System and method for cleaning noisy genetic data and determining chromosome copy number

ABSTRACT

Disclosed herein is a system and method for increasing the fidelity of measured genetic data, for making allele calls, and for determining the state of aneuploidy, in one or a small set of cells, or from fragmentary DNA, where a limited quantity of genetic data is available. Poorly or incorrectly measured base pairs, missing alleles and missing regions are reconstructed using expected similarities between the target genome and the genome of genetically related individuals. In accordance with one embodiment, incomplete genetic data from an embryonic cell are reconstructed at a plurality of loci using the more complete genetic data from a larger sample of diploid cells from one or both parents, with or without haploid genetic data from one or both parents. In another embodiment, the chromosome copy number can be determined from the measured genetic data, with or without genetic information from one or both parents.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation-in-part of U.S. application Ser. No.16/411,507 filed May 14, 2019, and a continuation-in-part of U.S.application Ser. No. 16/399,911 filed Apr. 30, 2019. U.S. applicationSer. No. 16/411,507 is a continuation of U.S. application Ser. No.16/288,690 filed Feb. 28, 2019, which is a continuation of U.S.application Ser. No. 15/187,555 filed Jun. 20, 2016, now U.S. Pat. No.10,227,652, which is a continuation of U.S. application Ser. No.14/092,457 filed Nov. 27, 2013, now U.S. Pat. No. 9,430,611. U.S.application Ser. No. 14/092,457 is a continuation of U.S. applicationSer. No. 13/793,133 filed Mar. 11, 2013, now U.S. Pat. No. 9,424,392,and a continuation of U.S. application Ser. No. 13/793,186 filed Mar.11, 2013, now U.S. Pat. No. 8,682,592. Each of U.S. application Ser. No.13/793,133 and U.S. application Ser. No. 13/793,186 is a continuation ofU.S. application Ser. No. 11/603,406 filed Nov. 22, 2006, now U.S. Pat.No. 8,532,930, which claims the benefit of U.S. Prov. Appl. No.60/739,882 filed Nov. 26, 2005; U.S. Prov. Appl. No. 60/742,305 filedDec. 6, 2005; U.S. Prov. Appl. No. 60/754,396 filed Dec. 29, 2005; U.S.Prov. Appl. No. 60/774,976 filed Feb. 21, 2006; U.S. Prov. Appl. No.60/789,506 filed Apr. 4, 2006; U.S. Prov. Appl. No. 60/817,741 filedJun. 30, 2006; and U.S. Prov. Appl. No. 60/846,610 filed Sep. 22, 2006.U.S. application Ser. No. 16/399,911 is a continuation of U.S.application Ser. No. 15/887,746 filed Feb. 2, 2018, which is acontinuation of U.S. application Ser. No. 15/446,778 filed Mar. 1, 2017,now U.S. Pat. No. 10,260,096, which is a continuation of U.S.application Ser. No. 13/949,212 filed Jul. 23, 2013, now U.S. Pat. No.10,083,273, which is a continuation of U.S. application Ser. No.12/076,348 filed Mar. 17, 2008, now U.S. Pat. No. 8,515,679. U.S.application Ser. No. 12/076,348 is a continuation-in-part of U.S.application Ser. No. 11/496,982 filed Jul. 31, 2006, now abandoned; acontinuation-in-part of U.S. application Ser. No. 11/603,406 filed Nov.22, 2006, now U.S. Pat. No. 8,532,930; and a continuation-in-part ofU.S. application Ser. No. 11/634,550 filed Dec. 6, 2006, now abandoned;and claims the benefit of U.S. Prov. Appl. No. 60/918,292 filed Mar. 16,2007; U.S. Prov. Appl. No. 60/926,198 filed Apr. 25, 2007; U.S. Prov.Appl. No. 60/932,456 filed May 31, 2007; U.S. Prov. Appl. No. 60/934,440filed Jun. 13, 2007; U.S. Prov. Appl. No. 61/003,101 filed Nov. 13,2007; and U.S. Prov. Appl. No. 61/008,637 filed Dec. 21, 2007. U.S.application Ser. No. 11/634,550 claims the benefit of U.S. Prov. Appl.No. 60/742,305 filed Dec. 6, 2005; U.S. Prov. Appl. No. 60/754,396 filedDec. 29, 2005; U.S. Prov. Appl. No. 60/774,976 filed Feb. 21, 2006; U.S.Prov. Appl. No. 60/789,506 filed Apr. 4, 2006; U.S. Prov. Appl. No.60/817,741 filed Jun. 30, 2006; and U.S. Prov. Appl. No. 60/846,610filed Sep. 22, 2006. U.S. application Ser. No. 11/603,406 is acontinuation-in-part of U.S. application Ser. No. 11/496,982 filed Jul.31, 2006; and also claims the benefit of U.S. Prov. Appl. No. 60/739,882filed Nov. 26, 2005; U.S. Prov. Appl. No. 60/742,305 filed Dec. 6, 2005;U.S. Prov. Appl. No. 60/754,396 filed Dec. 29, 2005; U.S. Prov. Appl.No. 60/774,976 filed Feb. 21, 2006; U.S. Prov. Appl. No. 60/789,506filed Apr. 4, 2006; U.S. Prov. Appl. No. 60/817,741 filed Jun. 30, 2006;and U.S. Prov. Appl. No. 60/846,610 filed Sep. 22, 2006. U.S.application Ser. No. 11/496,982 claims the benefit of U.S. Prov. Appl.No. 60/703,415 filed Jul. 29, 2005; U.S. Prov. Appl. No. 60/742,305filed Dec. 6, 2005; U.S. Prov. Appl. No. 60/754,396 filed Dec. 29, 2005;U.S. Prov. Appl. No. 60/774,976 filed Feb. 21, 2006; U.S. Prov. Appl.No. 60/789,506 filed Apr. 4, 2006; and U.S. Prov. Appl. No. 60/817,741filed Jun. 30, 2006. Each of the applications cited above is herebyincorporated by reference in its entirety.

Government Licensing Rights

This invention was made with government support under Grant No.R44HD054958-02A2 awarded by the National Institutes of Health. Thegovernment has certain rights in the invention.

BACKGROUND OF THE INVENTION Field of the Invention

The invention relates generally to the field of acquiring, manipulatingand using genetic data for medically predictive purposes, andspecifically to a system in which imperfectly measured genetic data of atarget individual are made more accurate by using known genetic data ofgenetically related individuals, thereby allowing more effectiveidentification of genetic variations, specifically aneuploidy anddisease linked genes, that could result in various phenotypic outcomes.

Description of the Related Art

In 2006, across the globe, roughly 800,000 in vitro fertilization (IVF)cycles were run. Of the roughly 150,000 cycles run in the US, about10,000 involved pre-implantation genetic diagnosis (PGD). Current PGDtechniques are unregulated, expensive and highly unreliable: error ratesfor screening disease-linked loci or aneuploidy are on the order of 10%,each screening test costs roughly $5,000, and a couple is forced tochoose between testing aneuploidy, which afflicts roughly 50% of IVFembryos, or screening for disease-linked loci on the single cell. Thereis a great need for an affordable technology that can reliably determinegenetic data from a single cell in order to screen in parallel foraneuploidy, monogenic diseases such as Cystic Fibrosis, andsusceptibility to complex disease phenotypes for which the multiplegenetic markers are known through whole-genome association studies.

Most PGD today focuses on high-level chromosomal abnormalities such asaneuploidy and balanced translocations with the primary outcomes beingsuccessful implantation and a take-home baby. The other main focus ofPGD is for genetic disease screening, with the primary outcome being ahealthy baby not afflicted with a genetically heritable disease forwhich one or both parents are carriers. In both cases, the likelihood ofthe desired outcome is enhanced by excluding genetically suboptimalembryos from transfer and implantation in the mother.

The process of PGD during IVF currently involves extracting a singlecell from the roughly eight cells of an early-stage embryo for analysis.Isolation of single cells from human embryos, while highly technical, isnow routine in IVF clinics. Both polar bodies and blastomeres have beenisolated with success. The most common technique is to remove singleblastomeres from day 3 embryos (6 or 8 cell stage). Embryos aretransferred to a special cell culture medium (standard culture mediumlacking calcium and magnesium), and a hole is introduced into the zonapellucida using an acidic solution, laser, or mechanical techniques. Thetechnician then uses a biopsy pipette to remove a single blastomere witha visible nucleus. Features of the DNA of the single (or occasionallymultiple) blastomere are measured using a variety of techniques. Sinceonly a single copy of the DNA is available from one cell, directmeasurements of the DNA are highly error-prone, or noisy. There is agreat need for a technique that can correct, or make more accurate,these noisy genetic measurements.

Normal humans have two sets of 23 chromosomes in every diploid cell,with one copy coming from each parent. Aneuploidy, the state of a cellwith extra or missing chromosome(s), and uniparental disomy, the stateof a cell with two of a given chromosome both of which originate fromone parent, are believed to be responsible for a large percentage offailed implantations and miscarriages, and some genetic diseases. Whenonly certain cells in an individual are aneuploid, the individual issaid to exhibit mosaicism. Detection of chromosomal abnormalities canidentify individuals or embryos with conditions such as Down syndrome,Klinefelter's syndrome, and Turner syndrome, among others, in additionto increasing the chances of a successful pregnancy. Testing forchromosomal abnormalities is especially important as the age of apotential mother increases: between the ages of 35 and 40 it isestimated that between 40% and 50% of the embryos are abnormal, andabove the age of 40, more than half of the embryos are like to beabnormal. The main cause of aneuploidy is nondisjunction during meiosis.Maternal nondisjunction constitutes 88% of all nondisjunction of which65% occurs in meiosis 1 and 23% in meiosis II. Common types of humananeuploidy include trisomy from meiosis I nondisjunction, monosomy, anduniparental disomy. In a particular type of trisomy that arises inmeiosis II nondisjunction, or M2 trisomy, an extra chromosome isidentical to one of the two normal chromosomes. M2 trisomy isparticularly difficult to detect. There is a great need for a bettermethod that can detect for many or all types of aneuploidy at most orall of the chromosomes efficiently and with high accuracy.

Karyotyping, the traditional method used for the prediction ofaneuploidy and mosaicism is giving way to other more high-throughput,more cost effective methods such as Flow Cytometry (FC) and fluorescentin situ hybridization (FISH). Currently, the vast majority of prenataldiagnoses use FISH, which can determine large chromosomal aberrationsand PCR/electrophoresis, and which can determine a handful of SNPs orother allele calls. One advantage of FISH is that it is less expensivethan karyotyping, but the technique is complex and expensive enough thatgenerally a small selection of chromosomes are tested (usuallychromosomes 13, 18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); inaddition, FISH has a low level of specificity. Roughly seventy-fivepercent of PGD today measures high-level chromosomal abnormalities suchas aneuploidy using FISH with error rates on the order of 10-15%. Thereis a great demand for an aneuploidy screening method that has a higherthroughput, lower cost, and greater accuracy.

The number of known disease associated genetic alleles is currently at389 according to OMIM and steadily climbing. Consequently, it isbecoming increasingly relevant to analyze multiple positions on theembryonic DNA, or loci, that are associated with particular phenotypes.A clear advantage of pre-implantation genetic diagnosis over prenataldiagnosis is that it avoids some of the ethical issues regardingpossible choices of action once undesirable phenotypes have beendetected. A need exists for a method for more extensive genotyping ofembryos at the pre-implantation stage.

There are a number of advanced technologies that enable the diagnosis ofgenetic aberrations at one or a few loci at the single-cell level. Theseinclude interphase chromosome conversion, comparative genomichybridization, fluorescent PCR, mini-sequencing and whole genomeamplification. The reliability of the data generated by all of thesetechniques relies on the quality of the DNA preparation. Better methodsfor the preparation of single-cell DNA for amplification and PGD aretherefore needed and are under study. All genotyping techniques, whenused on single cells, small numbers of cells, or fragments of DNA,suffer from integrity issues, most notably allele drop out (ADO). Thisis exacerbated in the context of in-vitro fertilization since theefficiency of the hybridization reaction is low, and the technique mustoperate quickly in order to genotype the embryo within the time periodof maximal embryo viability. There exists a great need for a method thatalleviates the problem of a high ADO rate when measuring genetic datafrom one or a small number of cells, especially when time constraintsexist.

Listed here is a set of prior art which is related to the field of thecurrent invention. None of this prior art contains or in any way refersto the novel elements of the current invention. In U.S. Pat. No.6,489,135 Parrott et al. provide methods for determining variousbiological characteristics of in vitro fertilized embryos, includingoverall embryo health, implantability, and increased likelihood ofdeveloping successfully to term by analyzing media specimens of in vitrofertilization cultures for levels of bioactive lipids in order todetermine these characteristics. In US Patent Application 20040033596Threadgill et al. describe a method for preparing homozygous cellularlibraries useful for in vitro phenotyping and gene mapping involvingsite-specific mitotic recombination in a plurality of isolated parentcells. In U.S. Pat. No. 5,635,366 Cooke et al. provide a method forpredicting the outcome of IVF by determining the level of110-hydroxysteroid dehydrogenase (11β-HSD) in a biological sample from afemale patient. In U.S. Pat. No. 7,058,517 Denton et al. describe amethod wherein an individual's haplotypes are compared to a knowndatabase of haplotypes in the general population to predict clinicalresponse to a treatment. In U.S. Pat. No. 7,035,739 Schadt at a1.describe a method is described wherein a genetic marker map isconstructed and the individual genes and traits are analyzed to give agene-trait locus data, which are then clustered as a way to identifygenetically interacting pathways, which are validated using multivariateanalysis. In US Patent Application US 2004/0137470 A1, Dhallan et al.describe using primers especially selected so as to improve theamplification rate, and detection of, a large number of pertinentdisease related loci, and a method of more efficiently quantitating theabsence, presence and/or amount of each of those genes. In World PatentApplication WO 03/031646, Findlay et al. describe a method to use animproved selection of genetic markers such that amplification of thelimited amount of genetic material will give more uniformly amplifiedmaterial, and it can be genotyped with higher fidelity.

Current methods of prenatal diagnosis can alert physicians and parentsto abnormalities in growing fetuses. Without prenatal diagnosis, one in50 babies is born with serious physical or mental handicap, and as manyas one in 30 will have some form of congenital malformation.Unfortunately, standard methods require invasive testing and carry aroughly 1 percent risk of miscarriage. These methods includeamniocentesis, chorion villus biopsy and fetal blood sampling. Of these,amniocentesis is the most common procedure; in 2003, it was performed inapproximately 3% of all pregnancies, though its frequency of use hasbeen decreasing over the past decade and a half. A major drawback ofprenatal diagnosis is that given the limited courses of action once anabnormality has been detected, it is only valuable and ethical to testfor very serious defects. As result, prenatal diagnosis is typicallyonly attempted in cases of high-risk pregnancies, where the elevatedchance of a defect combined with the seriousness of the potentialabnormality outweighs the risks. A need exists for a method of prenataldiagnosis that mitigates these risks.

It has recently been discovered that cell-free fetal DNA and intactfetal cells can enter maternal blood circulation. Consequently, analysisof these cells can allow early Non-Invasive Prenatal Genetic Diagnosis(NIPGD). A key challenge in using NIPGD is the task of identifying andextracting fetal cells or nucleic acids from the mother's blood. Thefetal cell concentration in maternal blood depends on the stage ofpregnancy and the condition of the fetus, but estimates range from oneto forty fetal cells in every milliliter of maternal blood, or less thanone fetal cell per 100,000 maternal nucleated cells. Current techniquesare able to isolate small quantities of fetal cells from the mother'sblood, although it is very difficult to enrich the fetal cells to purityin any quantity. The most effective technique in this context involvesthe use of monoclonal antibodies, but other techniques used to isolatefetal cells include density centrifugation, selective lysis of adulterythrocytes, and FACS. Fetal DNA isolation has been demonstrated usingPCR amplification using primers with fetal-specific DNA sequences. Sinceonly tens of molecules of each embryonic SNP are available through thesetechniques, the genotyping of the fetal tissue with high fidelity is notcurrently possible.

Much research has been done towards the use of pre-implantation geneticdiagnosis (PGD) as an alternative to classical prenatal diagnosis ofinherited disease. Most PGD today focuses on high-level chromosomalabnormalities such as aneuploidy and balanced translocations with theprimary outcomes being successful implantation and a take-home baby. Aneed exists for a method for more extensive genotyping of embryos at thepre-implantation stage. The number of known disease associated geneticalleles is currently at 389 according to OMIM and steadily climbing.Consequently, it is becoming increasingly relevant to analyze multipleembryonic SNPs that are associated with disease phenotypes. A clearadvantage of pre-implantation genetic diagnosis over prenatal diagnosisis that it avoids some of the ethical issues regarding possible choicesof action once undesirable phenotypes have been detected.

Many techniques exist for isolating single cells. The FACS machine has avariety of applications; one important application is to discriminatebetween cells based on size, shape and overall DNA content. The FACSmachine can be set to sort single cells into any desired container. Manydifferent groups have used single cell DNA analysis for a number ofapplications, including prenatal genetic diagnosis, recombinationstudies, and analysis of chromosomal imbalances. Single-sperm genotypinghas been used previously for forensic analysis of sperm samples (todecrease problems arising from mixed samples) and for single-cellrecombination studies.

Isolation of single cells from human embryos, while highly technical, isnow routine in in vitro fertilization clinics. To date, the vastmajority of prenatal diagnoses have used fluorescent in situhybridization (FISH), which can determine large chromosomal aberrations(such as Down syndrome, or trisomy 21) and PCR/electrophoresis, whichcan determine a handful of SNPs or other allele calls. Both polar bodiesand blastomeres have been isolated with success. It is critical toisolate single blastomeres without compromising embryonic integrity. Themost common technique is to remove single blastomeres from day 3 embryos(6 or 8 cell stage). Embryos are transferred to a special cell culturemedium (standard culture medium lacking calcium and magnesium), and ahole is introduced into the zona pellucida using an acidic solution,laser, or mechanical drilling. The technician then uses a biopsy pipetteto remove a single visible nucleus. Clinical studies have demonstratedthat this process does not decrease implantation success, since at thisstage embryonic cells are undifferentiated.

There are three major methods available for whole genome amplification(WGA): ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primerPCR (DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR,short DNA sequences called adapters are ligated to blunt ends of DNA.These adapters contain universal amplification sequences, which are usedto amplify the DNA by PCR. In DOP-PCR, random primers that also containuniversal amplification sequences are used in a first round of annealingand PCR. Then, a second round of PCR is used to amplify the sequencesfurther with the universal primer sequences. Finally, MDA uses thephi-29 polymerase, which is a highly processive and non-specific enzymethat replicates DNA and has been used for single-cell analysis. Of thethree methods, DOP-PCR reliably produces large quantities of DNA fromsmall quantities of DNA, including single copies of chromosomes. On theother hand, MDA is the fastest method, producing hundred-foldamplification of DNA in a few hours. The major limitations toamplification material from a single cells are (1) necessity of usingextremely dilute DNA concentrations or extremely small volume ofreaction mixture, and (2) difficulty of reliably dissociating DNA fromproteins across the whole genome. Regardless, single-cell whole genomeamplification has been used successfully for a variety of applicationsfor a number of years.

There are numerous difficulties in using DNA amplification in thesecontexts. Amplification of single-cell DNA (or DNA from a small numberof cells, or from smaller amounts of DNA) by PCR can fail completely, asreported in 5-10% of the cases. This is often due to contamination ofthe DNA, the loss of the cell, its DNA, or accessibility of the DNAduring the PCR reaction. Other sources of error that may arise inmeasuring the embryonic DNA by amplification and microarray analysisinclude transcription errors introduced by the DNA polymerase where aparticular nucleotide is incorrectly copied during PCR, and microarrayreading errors due to imperfect hybridization on the array. The biggestproblem, however, remains allele drop-out (ADO) defined as the failureto amplify one of the two alleles in a heterozygous cell. ADO can affectup to more than 40% of amplifications and has already caused PGDmisdiagnoses. ADO becomes a health issue especially in the case of adominant disease, where the failure to amplify can lead to implantationof an affected embryo. The need for more than one set of primers pereach marker (in heterozygotes) complicate the PCR process. Therefore,more reliable PCR assays are being developed based on understanding theADO origin. Reaction conditions for single-cell amplifications are understudy. The amplicon size, the amount of DNA degradation, freezing andthawing, and the PCR program and conditions can each influence the rateof ADO.

All those techniques, however, depend on the minute DNA amount availablefor amplification in the single cell. This process is often accompaniedby contamination. Proper sterile conditions and microsatellite sizingcan exclude the chance of contaminant DNA as microsatellite analysisdetected only in parental alleles exclude contamination. Studies toreliably transfer molecular diagnostic protocols to the single-celllevel have been recently pursued using first-round multiplex PCR ofmicrosatellite markers, followed by real-time PCR and microsatellitesizing to exclude chance contamination. Multiplex PCR allows for theamplification of multiple fragments in a single reaction, a crucialrequirement in the single-cell DNA analysis. Although conventional PCRwas the first method used in PGD, fluorescence in situ hybridization(FISH) is now common. It is a delicate visual assay that allows thedetection of nucleic acid within undisturbed cellular and tissuearchitecture. It relies firstly on the fixation of the cells to beanalyzed. Consequently, optimization of the fixation and storagecondition of the sample is needed, especially for single-cellsuspensions.

Advanced technologies that enable the diagnosis of a number of diseasesat the single-cell level include interphase chromosome conversion,comparative genomic hybridization (CGH), fluorescent PCR, and wholegenome amplification. The reliability of the data generated by all ofthese techniques rely on the quality of the DNA preparation. PGD is alsocostly, consequently there is a need for less expensive approaches, suchas mini-sequencing. Unlike most mutation-detection techniques,mini-sequencing permits analysis of very small DNA fragments with lowADO rate. Better methods for the preparation of single-cell DNA foramplification and PGD are therefore needed and are under study. The morenovel microarrays and comparative genomic hybridization techniques,still ultimately rely on the quality of the DNA under analysis.

Several techniques are in development to measure multiple SNPs on theDNA of a small number of cells, a single cell (for example, ablastomere), a small number of chromosomes, or from fragments of DNA.There are techniques that use Polymerase Chain Reaction (PCR), followedby microarray genotyping analysis. Some PCR-based techniques includewhole genome amplification (WGA) techniques such as multipledisplacement amplification (MDA), and Molecular Inversion Probes (MIPS)that perform genotyping using multiple tagged oligonucleotides that maythen be amplified using PCR with a singe pair of primers. An example ofa non-PCR based technique is fluorescence in situ hybridization (FISH).It is apparent that the techniques will be severely error-prone due tothe limited amount of genetic material which will exacerbate the impactof effects such as allele drop-outs, imperfect hybridization, andcontamination.

Many techniques exist which provide genotyping data. Taqman is a uniquegenotyping technology produced and distributed by Applied Biosystems.Taqman uses polymerase chain reaction (PCR) to amplify sequences ofinterest. During PCR cycling, an allele specific minor groove binder(MGB) probe hybridizes to amplified sequences. Strand synthesis by thepolymerase enzymes releases reporter dyes linked to the MGB probes, andthen the Taqman optical readers detect the dyes. In this manner, Taqmanachieves quantitative allelic discrimination. Compared with array basedgenotyping technologies, Taqman is quite expensive per reaction(˜$0.40/reaction), and throughput is relatively low (384 genotypes perrun). While only ing of DNA per reaction is necessary, thousands ofgenotypes by Taqman requires microgram quantities of DNA, so Taqman doesnot necessarily use less DNA than microarrays. However, with respect tothe IVF genotyping workflow, Taqman is the most readily applicabletechnology. This is due to the high reliability of the assays and, mostimportantly, the speed and ease of the assay (˜3 hours per run andminimal molecular biological steps). Also unlike many array technologies(such as 500 k Affymetrix arrays), Taqman is highly customizable, whichis important for the IVF market. Further, Taqman is highly quantitative,so anueploidies could be detected with this technology alone.

Illumina has recently emerged as a leader in high-throughput genotyping.Unlike Affymetrix, Illumina genotyping arrays do not rely exclusively onhybridization. Instead, Illumina technology uses an allele-specific DNAextension step, which is much more sensitive and specific thanhybridization alone, for the original sequence detection. Then, all ofthese alleles are amplified in multiplex by PCR, and then these productshybridized to bead arrays. The beads on these arrays contain unique“address” tags, not native sequence, so this hybridization is highlyspecific and sensitive. Alleles are then called by quantitative scanningof the bead arrays. The Illlumina Golden Gate assay system genotypes upto 1536 loci concurrently, so the throughput is better than Taqman butnot as high as Affymetrix 500 k arrays. The cost of Illumina genotypesis lower than Taqman, but higher than Affymetrix arrays. Also, theIllumina platform takes as long to complete as the 500 k Affymetrixarrays (up to 72 hours), which is problematic for IVF genotyping.However, Illumina has a much better call rate, and the assay isquantitative, so anueploidies are detectable with this technology.Illumina technology is much more flexible in choice of SNPs than 500 kAffymetrix arrays.

One of the highest throughput techniques, which allows for themeasurement of up to 250,000 SNPs at a time, is the Affymetrix GeneChip500K genotyping array. This technique also uses PCR, followed byanalysis by hybridization and detection of the amplified DNA sequencesto DNA probes, chemically synthesized at different locations on a quartzsurface. A disadvantage of these arrays are the low flexibility and thelower sensitivity. There are modified approaches that can increaseselectivity, such as the “perfect match” and “mismatch probe”approaches, but these do so at the cost of the number of SNPs calls perarray.

Pyrosequencing, or sequencing by synthesis, can also be used forgenotyping and SNP analysis. The main advantages to pyrosequencinginclude an extremely fast turnaround and unambiguous SNP calls, however,the assay is not currently conducive to high-throughput parallelanalysis. PCR followed by gel electrophoresis is an exceedingly simpletechnique that has met the most success in preimplantation diagnosis. Inthis technique, researchers use nested PCR to amplify short sequences ofinterest. Then, they run these DNA samples on a special gel to visualizethe PCR products. Different bases have different molecular weights, soone can determine base content based on how fast the product runs in thegel. This technique is low-throughput and requires subjective analysesby scientists using current technologies, but has the advantage of speed(1-2 hours of PCR, 1 hour of gel electrophoresis). For this reason, ithas been used previously for prenatal genotyping for a myriad ofdiseases, including: thalassaemia, neurofibromatosis type 2, leukocyteadhesion deficiency type I, Hallopeau-Siemens disease, sickle-cellanemia, retinoblastoma, Pelizaeus-Merzbacher disease, Duchenne musculardystrophy, and Currarino syndrome.

Another promising technique that has been developed for genotyping smallquantities of genetic material with very high fidelity is MolecularInversion Probes (MIPs), such as Affymetrix's Genflex Arrays. Thistechnique has the capability to measure multiple SNPs in parallel: morethan 10,000 SNPS measured in parallel have been verified. For smallquantities of genetic material, call rates for this technique have beenestablished at roughly 95%, and accuracy of the calls made has beenestablished to be above 99%. So far, the technique has been implementedfor quantities of genomic data as small as 150 molecules for a givenSNP. However, the technique has not been verified for genomic data froma single cell, or a single strand of DNA, as would be required forpre-implantation genetic diagnosis.

The MIP technique makes use of padlock probes which are linearoligonucleotides whose two ends can be joined by ligation when theyhybridize to immediately adjacent target sequences of DNA. After theprobes have hybridized to the genomic DNA, a gap-fill enzyme is added tothe assay which can add one of the four nucleotides to the gap. If theadded nucleotide (A,C,T,G) is complementary to the SNP undermeasurement, then it will hybridize to the DNA, and join the ends of thepadlock probe by ligation. The circular products, or closed padlockprobes, are then differentiated from linear probes by exonucleolysis.The exonuclease, by breaking down the linear probes and leaving thecircular probes, will change the relative concentrations of the closedvs. the unclosed probes by a factor of 1000 or more. The probes thatremain are then opened at a cleavage site by another enzyme, removedfrom the DNA, and amplified by PCR. Each probe is tagged with adifferent tag sequence consisting of 20 base tags (16,000 have beengenerated), and can be detected, for example, by the Affymetrix GenFlexTag Array. The presence of the tagged probe from a reaction in which aparticular gap-fill enzyme was added indicates the presence of thecomplimentary amino acid on the relevant SNP.

The molecular biological advantages of MIPS include: (1) multiplexedgenotyping in a single reaction, (2) the genotype “call” occurs by gapfill and ligation, not hybridization, and (3) hybridization to an arrayof universal tags decreases false positives inherent to most arrayhybridizations. In traditional 500K, TaqMan and other genotyping arrays,the entire genomic sample is hybridized to the array, which contains avariety of perfect match and mismatch probes, and an algorithm callslikely genotypes based on the intensities of the mismatch and perfectmatch probes. Hybridization, however, is inherently noisy, because ofthe complexities of the DNA sample and the huge number of probes on thearrays. MIPs, on the other hand, uses multiplex probes (i.e., not on anarray) that are longer and therefore more specific, and then uses arobust ligation step to circularize the probe. Background is exceedinglylow in this assay (due to specificity), though allele dropout may behigh (due to poor performing probes).

When this technique is used on genomic data from a single cell (or smallnumbers of cells) it will like PCR based approaches suffer fromintegrity issues. For example, the inability of the padlock probe tohybridize to the genomic DNA will cause allele dropouts. This will beexacerbated in the context of in-vitro fertilization since theefficiency of the hybridization reaction is low, and it needs to proceedrelatively quickly in order to genotype the embryo in a limited timeperiod. Note that the hybridization reaction can be reduced well belowvendor-recommended levels, and micro-fluidic techniques may also be usedto accelerate the hybridization reaction. These approaches to reducingthe time for the hybridization reaction will result in reduced dataquality.

Once the genetic data has been measured, the next step is to use thedata for predictive purposes. Much research has been done in predictivegenomics, which tries to understand the precise functions of proteins,RNA and DNA so that phenotypic predictions can be made based ongenotype. Canonical techniques focus on the function ofSingle-Nucleotide Polymorphisms (SNP); but more advanced methods arebeing brought to bear on multi-factorial phenotypic features. Thesemethods include techniques, such as linear regression and nonlinearneural networks, which attempt to determine a mathematical relationshipbetween a set of genetic and phenotypic predictors and a set of measuredoutcomes. There is also a set of regression analysis techniques, such asRidge regression, log regression and stepwise selection, that aredesigned to accommodate sparse data sets where there are many potentialpredictors relative to the number of outcomes, as is typical of geneticdata, and which apply additional constraints on the regressionparameters so that a meaningful set of parameters can be resolved evenwhen the data is underdetermined. Other techniques apply principalcomponent analysis to extract information from undetermined data sets.Other techniques, such as decision trees and contingency tables, usestrategies for subdividing subjects based on their independent variablesin order to place subjects in categories or bins for which thephenotypic outcomes are similar. A recent technique, termed logicalregression, describes a method to search for different logicalinterrelationships between categorical independent variables in order tomodel a variable that depends on interactions between multipleindependent variables related to genetic data. Regardless of the methodused, the quality of the prediction is naturally highly dependant on thequality of the genetic data used to make the prediction.

Normal humans have two sets of 23 chromosomes in every diploid cell,with one copy coming from each parent. Aneuploidy, a cell with an extraor missing chromosomes, and uniparental disomy, a cell with two of agiven chromosome that originate from one parent, are belived to beresponsible for a large percentage of failed implantations,miscarriages, and genetic diseases. When only certain cells in anindividual are aneuploid, the individual is said to exhibit mosaicism.Detection of chromosomal abnormalities can identify individuals orembryos with conditions such as Down syndrome, Klinefelters syndrome,and Turner syndrome, among others, in addition to increasing the chancesof a successful pregnancy. Testing for chromosomal abnormalities isespecially important as mothers age: between the ages of 35 and 40 it isestimated that between 40% and 50% of the embryos are abnormal, andabove the age of 40, more than half of the embryos are abnormal.

Karyotyping, the traditional method used for the prediction ofaneuploides and mosaicism is giving way to other more high throughput,more cost effective methods. One method that has attracted muchattention recently is Flow cytometry (FC) and fluorescence in situhybridization (FISH) which can be used to detect aneuploidy in any phaseof the cell cycle. One advantage of this method is that it is lessexpensive than karyotyping, but the cost is significant enough thatgenerally a small selection of chromosomes are tested (usuallychromosomes 13, 18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); inaddition, FISH has a low level of specificity. Using FISH to analyze 15cells, one can detect mosaicism of 19% with 95% confidence. Thereliability of the test becomes much lower as the level of mosaicismgets lower, and as the number of cells to analyze decreases. The test isestimated to have a false negative rate as high as 15% when a singlecell is analysed. There is a great demand for a method that has a higherthroughput, lower cost, and greater accuracy.

Listed here is a set of prior art which is related to the field of thecurrent invention. None of this prior art contains or in any way refersto the novel elements of the current invention. In U.S. Pat. No.6,720,140, Hartley et al describe a recombinational cloning method formoving or exchanging segments of DNA molecules using engineeredrecombination sites and recombination proteins. In U.S. Pat. No.6,489,135 Parrott et al. provide methods for determining variousbiological characteristics of in vitro fertilized embryos, includingoverall embryo health, implantability, and increased likelihood ofdeveloping successfully to term by analyzing media specimens of in vitrofertilization cultures for levels of bioactive lipids in order todetermine these characteristics. In US Patent Application 20040033596Threadgill et al. describe a method for preparing homozygous cellularlibraries useful for in vitro phenotyping and gene mapping involvingsite-specific mitotic recombination in a plurality of isolated parentcells. In U.S. Pat. No. 5,994,148 Stewart et al. describe a method ofdetermining the probability of an in vitro fertilization (IVF) beingsuccessful by measuring Relaxin directly in the serum or indirectly byculturing granulosa lutein cells extracted from the patient as part ofan IVF/ET procedure. In US Patent application 5,635,366 Cooke et al.provide a method for predicting the outcome of IVF by determining thelevel of 110-hydroxysteroid dehydrogenase (11β-HSD) in a biologicalsample from a female patient. In U.S. Pat. No. 7,058,616 Larder et al.describe a method for using a neural network to predict the resistanceof a disease to a therapeutic agent. In U.S. Pat. No. 6,958,211Vingerhoets et al. describe a method wherein the integrase genotype of agiven HIV strain is simply compared to a known database of HIV integrasegenotype with associated phenotypes to find a matching genotype. In U.S.Pat. No. 7,058,517 Denton et al. describe a method wherein anindividual's haplotypes are compared to a known database of haplotypesin the general population to predict clinical response to a treatment.In U.S. Pat. No. 7,035,739 Schadt at a1. describe a method is describedwherein a genetic marker map is constructed and the individual genes andtraits are analyzed to give a gene-trait locus data, which are thenclustered as a way to identify genetically interacting pathways, whichare validated using multivariate analysis. In U.S. Pat. No. 6,025,128Veltri et al. describe a method involving the use of a neural networkutilizing a collection of biomarkers as parameters to evaluate risk ofprostate cancer recurrence.

The cost of DNA sequencing is dropping rapidly, and in the near futureindividual genomic sequencing for personal benefit will become morecommon. Knowledge of personal genetic data will allow for extensivephenotypic predictions to be made for the individual. In order to makeaccurate phenotypic predictions high quality genetic data is critical,whatever the context. In the case of prenatal or pre-implantationgenetic diagnoses a complicating factor is the relative paucity ofgenetic material available. Given the inherently noisy nature of themeasured genetic data in cases where limited genetic material is usedfor genotyping, there is a great need for a method which can increasethe fidelity of, or clean, the primary data.

SUMMARY OF THE INVENTION

The system disclosed enables the cleaning of incomplete or noisy geneticdata using secondary genetic data as a source of information, and alsothe determination of chromosome copy number using said genetic data.While the disclosure focuses on genetic data from human subjects, andmore specifically on as-yet not implanted embryos or developing fetuses,as well as related individuals, it should be noted that the methodsdisclosed apply to the genetic data of a range of organisms, in a rangeof contexts. The techniques described for cleaning genetic data are mostrelevant in the context of pre-implantation diagnosis during in-vitrofertilization, prenatal diagnosis in conjunction with amniocentesis,chorion villus biopsy, fetal tissue sampling, and non-invasive prenataldiagnosis, where a small quantity of fetal genetic material is isolatedfrom maternal blood. The use of this method may facilitate diagnosesfocusing on inheritable diseases, chromosome copy number predictions,increased likelihoods of defects or abnormalities, as well as makingpredictions of susceptibility to various disease- and non-diseasephenotypes for individuals to enhance clinical and lifestyle decisions.The invention addresses the shortcomings of prior art that are discussedabove.

In one aspect of the invention, methods make use of knowledge of thegenetic data of the mother and the father such as diploid tissuesamples, sperm from the father, haploid samples from the mother or otherembryos derived from the mother's and father's gametes, together withthe knowledge of the mechanism of meiosis and the imperfect measurementof the embryonic DNA, in order to reconstruct, in silico, the embryonicDNA at the location of key loci with a high degree of confidence. In oneaspect of the invention, genetic data derived from other relatedindividuals, such as other embryos, brothers and sisters, grandparentsor other relatives can also be used to increase the fidelity of thereconstructed embryonic DNA. It is important to note that the parentaland other secondary genetic data allows the reconstruction not only ofSNPs that were measured poorly, but also of insertions, deletions, andof SNPs or whole regions of DNA that were not measured at all.

In one aspect of the invention, the fetal or embryonic genomic datawhich has been reconstructed, with or without the use of genetic datafrom related individuals, can be used to detect if the cell isaneuploid, that is, where fewer or more than two of a particularchromosome is present in a cell. The reconstructed data can also be usedto detect for uniparental disomy, a condition in which two of a givenchromosome are present, both of which originate from one parent. This isdone by creating a set of hypotheses about the potential states of theDNA, and testing to see which hypothesis has the highest probability ofbeing true given the measured data. Note that the use of high throughputgenotyping data for screening for aneuploidy enables a single blastomerefrom each embryo to be used both to measure multiple disease-linked locias well as to screen for aneuploidy.

In another aspect of the invention, the direct measurements of theamount of genetic material, amplified or unamplified, present at aplurality of loci, can be used to detect for monosomy, uniparentaldisomy, trisomy and other aneuploidy states. The idea behind this methodis that measuring the amount of genetic material at multiple loci willgive a statistically significant result.

In another aspect of the invention, the measurements, direct orindirect, of a particular subset of SNPs, namely those loci where theparents are both homozygous but with different allele values, can beused to detect for chromosomal abnormalities by looking at the ratios ofmaternally versus paternally miscalled homozygous loci on the embryo.The idea behind this method is that those loci where each parent ishomozygous, but have different alleles, by definition result in aheterozygous loci on the embryo. Allele drop outs at those loci arerandom, and a shift in the ratio of loci miscalled as homozygous canonly be due to incorrect chromosome number.

It will be recognized by a person of ordinary skill in the art, giventhe benefit of this disclosure, that various aspects and embodiments ofthis disclosure may implemented in combination or separately.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1: determining probability of false negatives and false positivesfor different hypotheses.

FIG. 2: the results from a mixed female sample, all loci hetero.

FIG. 3: the results from a mixed male sample, all loci hetero.

FIG. 4: Ct measurements for male sample differenced from Ct measurementsfor female sample.

FIG. 5: the results from a mixed female sample; Taqman single dye.

FIG. 6: the results from a mixed male; Taqman single dye.

FIG. 7: the distribution of repeated measurements for mixed male sample.

FIG. 8: the results from a mixed female sample; qPCR measures.

FIG. 9: the results from a mixed male sample; qPCR measures.

FIG. 10: Ct measurements for male sample differenced from Ctmeasurements for female sample.

FIG. 11: detecting aneuploidy with a third dissimilar chromosome.

FIGS. 12A and 12B: an illustration of two amplification distributionswith constant allele dropout rate.

FIG. 13: a graph of the Gaussian probability density function of alpha.

FIG. 14: matched filter and performance for 19 loci measured with Taqmanon 60 pg of DNA.

FIG. 15: matched filter and performance for 13 loci measured with Taqmanon 6 pg of DNA.

FIG. 16: matched filter and performance for 20 loci measured with qPCRon 60 pg of DNA.

FIG. 17: matched filter and performance for 20 loci measured with qPCRon 6 pg of DNA.

FIG. 18A: matched filter and performance for 20 loci using MDA andTaqman on 16 single cells.

FIG. 18B: Matched filter and performance for 11 loci on Chromosome 7 and13 loci on Chromosome X using MDA and Taqman on 15 single cells.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Conceptual Overview ofthe System

The goal of the disclosed system is to provide highly accurate genomicdata for the purpose of genetic diagnoses. In cases where the geneticdata of an individual contains a significant amount of noise, or errors,the disclosed system makes use of the expected similarities between thegenetic data of the target individual and the genetic data of relatedindividuals, to clean the noise in the target genome. This is done bydetermining which segments of chromosomes of related individuals wereinvolved in gamete formation and, when necessary where crossovers mayhave occurred during meiosis, and therefore which segments of thegenomes of related individuals are expected to be nearly identical tosections of the target genome. In certain situations this method can beused to clean noisy base pair measurements on the target individual, butit also can be used to infer the identity of individual base pairs orwhole regions of DNA that were not measured. It can also be used todetermine the number of copies of a given chromosome segment in thetarget individual. In addition, a confidence may be computed for eachcall made. A highly simplified explanation is presented first, makingunrealistic assumptions in order to illustrate the concept of theinvention. A detailed statistical approach that can be applied to thetechnology of today is presented afterward.

In one aspect of the invention, the target individual is an embryo, andthe purpose of applying the disclosed method to the genetic data of theembryo is to allow a doctor or other agent to make an informed choice ofwhich embryo(s) should be implanted during IVF. In another aspect of theinvention, the target individual is a fetus, and the purpose of applyingthe disclosed method to genetic data of the fetus is to allow a doctoror other agent to make an informed choice about possible clinicaldecisions or other actions to be taken with respect to the fetus.

Definitions

SNP (Single Nucleotide Polymorphism): a single nucleotide that maydiffer between the genomes of two members of the same species. In ourusage of the term, we do not set any limit on the frequency with whicheach variant occurs.

To call a SNP: to make a decision about the true state of a particularbase pair, taking into account the direct and indirect evidence.

Locus: a particular region of interest on the DNA of an individual,which may refer to a SNP, the site of a possible insertion or deletion,or the site of some other relevant genetic variation. Disease-linkedSNPs may also refer to disease-linked loci.

To call an allele: to determine the state of a particular locus of DNA.This may involve calling a SNP, or determining whether or not aninsertion or deletion is present at that locus, or determining thenumber of insertions that may be present at that locus, or determiningwhether some other genetic variant is present at that locus.

Correct allele call: An allele call that correctly reflects the truestate of the actual genetic material of an individual.

To clean genetic data: to take imperfect genetic data and correct someor all of the errors or fill in missing data at one or more loci. In thecontext of this disclosure, this involves using genetic data of relatedindividuals and the method described herein.

To increase the fidelity of allele calls: to clean genetic data.

Imperfect genetic data: genetic data with any of the following: alleledropouts, uncertain base pair measurements, incorrect base pairmeasurements, missing base pair measurements, uncertain measurements ofinsertions or deletions, uncertain measurements of chromosome segmentcopy numbers, spurious signals, missing measurements, other errors, orcombinations thereof.

Noisy genetic data: imperfect genetic data, also called incompletegenetic data.

Uncleaned genetic data: genetic data as measured, that is, where nomethod has been used to correct for the presence of noise or errors inthe raw genetic data; also called crude genetic data.

Confidence: the statistical likelihood that the called SNP, allele, setof alleles, or determined number of chromosome segment copies correctlyrepresents the real genetic state of the individual.

Parental Support (PS): a name sometimes used for the any of the methodsdisclosed herein, where the genetic information of related individualsis used to determine the genetic state of target individuals. In somecases, it refers specifically to the allele calling method, sometimes tothe method used for cleaning genetic data, sometimes to the method todetermine the number of copies of a segment of a chromosome, andsometimes to some or all of these methods used in combination.

Copy Number Calling (CNC): the name given to the method described inthis disclosure used to determine the number of chromosome segments in acell.

Qualitative CNC (also qCNC): the name given to the method in thisdisclosure used to determine chromosome copy number in a cell that makesuse of qualitative measured genetic data of the target individual and ofrelated individuals.

Multigenic: affected by multiple genes, or alleles.

Direct relation: mother, father, son, or daughter.

Chromosomal Region: a segment of a chromosome, or a full chromosome.

Segment of a Chromosome: a section of a chromosome that can range insize from one base pair to the entire chromosome.

Section: a section of a chromosome. Section and segment can be usedinterchangeably.

Chromosome: may refer to either a full chromosome, or also a segment orsection of a chromosome.

Copies: the number of copies of a chromosome segment may refer toidentical copies, or it may refer to non-identical copies of achromosome segment wherein the different copies of the chromosomesegment contain a substantially similar set of loci, and where one ormore of the alleles are different. Note that in some cases ofaneuploidy, such as the M2 copy error, it is possible to have somecopies of the given chromosome segment that are identical as well assome copies of the same chromosome segment that are not identical.

Haplotypic Data: also called ‘phased data’ or ‘ordered genetic data;’data from a single chromosome in a diploid or polyploid genome, i.e.,either the segregated maternal or paternal copy of a chromosome in adiploid genome.

Unordered Genetic Data: pooled data derived from measurements on two ormore chromosomes in a diploid or polyploid genome, i.e., both thematernal and paternal copies of a chromosome in a diploid genome.

Genetic data ‘in’, ‘of’, ‘at’ or ‘on’ an individual: These phrases allrefer to the data describing aspects of the genome of an individual. Itmay refer to one or a set of loci, partial or entire sequences, partialor entire chromosomes, or the entire genome.

Hypothesis: a set of possible copy numbers of a given set ofchromosomes, or a set of possible genotypes at a given set of loci. Theset of possibilities may contain one or more elements.

Target Individual: the individual whose genetic data is beingdetermined. Typically, only a limited amount of DNA is available fromthe target individual. In one context, the target individual is anembryo or a fetus.

Related Individual: any individual who is genetically related, and thusshares haplotype blocks, with the target individual.

Platform response: a mathematical characterization of the input/outputcharacteristics of a genetic measurement platform, such as TAQMAN orINFINIUM. The input to the channel is the true underlying genotypes ofthe genetic loci being measured. The channel output could be allelecalls (qualitative) or raw numerical measurements (quantitative),depending on the context. For example, in the case in which theplatform's raw numeric output is reduced to qualitative genotype calls,the platform response consists of an error transition matrix thatdescribes the conditional probability of seeing a particular outputgenotype call given a particular true genotype input. In the case inwhich the platform's output is left as raw numeric measurements, theplatform response is a conditional probability density function thatdescribes the probability of the numeric outputs given a particular truegenotype input.

Copy number hypothesis: a hypothesis about how many copies of aparticular chromosome segment are in the embryo. In a preferredembodiment, this hypothesis consists of a set of sub-hypotheses abouthow many copies of this chromosome segment were contributed by eachrelated individual to the target individual.

Technical Description of the System A Allele Calling: Preferred Method

Assume here the goal is to estimate the genetic data of an embryo asaccurately as possible, and where the estimate is derived frommeasurements taken from the embryo, father, and mother across the sameset of n SNPs. Note that where this description refers to SNPs, it mayalso refer to a locus where any genetic variation, such as a pointmutation, insertion or deletion may be present. This allele callingmethod is part of the Parental Support (PS) system. One way to increasethe fidelity of allele calls in the genetic data of a target individualfor the purposes of making clinically actionable predictions isdescribed here. It should be obvious to one skilled in the art how tomodify the method for use in contexts where the target individual is notan embryo, where genetic data from only one parent is available, whereneither, one or both of the parental haplotypes are known, or wheregenetic data from other related individuals is known and can beincorporated.

For the purposes of this discussion, only consider SNPs that admit twoallele values; without loss of generality it is possible to assume thatthe allele values on all SNPs belong to the alphabet A={A,C}. It is alsoassumed that the errors on the measurements of each of the SNPs areindependent. This assumption is reasonable when the SNPs being measuredare from sufficiently distant genic regions. Note that one couldincorporate information about haplotype blocks or other techniques tomodel correlation between measurement errors on SNPs without changingthe fundamental concepts of this invention.

Let e=(e₁,e₂) be the true, unknown, ordered SNP information on theembryo, e₁,e₂∈A″. Define e₁ to be the genetic haploid informationinherited from the father and e₂ to be the genetic haploid informationinherited from the mother. Also use e_(i)=(e_(1i),e_(2i)) to denote theordered pair of alleles at the i-th position of e. In similar fashion,let f=(f₁,f₂) and m=(m₁,m₂) be the true, unknown, ordered SNPinformation on the father and mother respectively. In addition, let g₁be the true, unknown, haploid information on a single sperm from thefather. (One can think of the letter g as standing for gamete. There isno g₂. The subscript is used to remind the reader that the informationis haploid, in the same way that f₁ and f₂ are haploid.) It is alsoconvenient to define r=(f,m), so that there is a symbol to represent thecomplete set of diploid parent information from which e inherits, andalso write r_(i)=(f_(i),m_(i))=((f_(1i),f_(2i)),(m_(1i),m_(2i))) todenote the complete set of ordered information on father and mother atthe i-th SNP. Finally, let ê=(ê₁, ê₂) be the estimate of e that issought, ê₁, ê₂ϵA^(n).

By a crossover map, it is meant an n-tuple θϵ{1,2}n that specifies how ahaploid pair such as (f₁,f₂) recombines to form a gamete such as e₁.Treating θ as a function whose output is a haploid sequence, defineθ(f)_(i)=θ(f_(i),f₂)_(i)=f_(θi,i). To make this idea more concrete, letf₁=ACAAACCC, let f₂=CAACCACA, and let θ=11111222. Thenθ(f₁,f₂)=ACAAAACA. In this example, the crossover map θ implicitlyindicates that a crossover occurred between SNPs i=5 and i=6.

Formally, let θ be the true, unknown crossover map that determines e₁from f, let ϕ be the true, unknown crossover map that determines e₂ fromm, and let ψ be the true, unknown crossover map that determines g₁ fromf. That is, e₁=θ(f), e₂=ϕ(m), g₁=ψ(f). It is also convenient to defineX=(θ,ϕ,ψ) so that there is a symbol to represent the complete set ofcrossover information associated with the problem. For simplicity sake,write e=X(r) as shorthand for e=(θ(f),ϕ(m)); also write e_(i)=X(r_(i))as shorthand for e_(i)=X(r)_(i)

In reality, when chromosomes combine, at most a few crossovers occur,making most of the 2^(n) theoretically possible crossover mapsdistinctly improbable. In practice, these very low probability crossovermaps will be treated as though they had probability zero, consideringonly crossover maps belonging to a comparatively small set Ω. a Forexample, if Ω is defined to be the set of crossover maps that derivefrom at most one crossover, then |Ω|=2n.

It is convenient to have an alphabet that can be used to describeunordered diploid measurements. To that end, let B={A,B,C,X}. Here A andC represent their respective homozygous locus states and B represents aheterozygous but unordered locus state. Note: this section is the onlysection of the document that uses the symbol B to stand for aheterozygous but unordered locus state. Most other sections of thedocument use the symbols A and B to stand for the two different allelevalues that can occur at a locus. X represents an unmeasured locus,i.e., a locus drop-out. To make this idea more concrete, letf₁=ACAAACCC, and let f₂=CAACCACA. Then a noiseless unordered diploidmeasurement off would yield {tilde over (f)}=BBABBBCB.

In the problem at hand, it is only possible to take unordered diploidmeasurements of e, f, and m, although there may be ordered haploidmeasurements on g₁. This results in noisy measured sequences that aredenoted {tilde over (e)} ϵB^(n), {tilde over (f)} ϵB^(n), {tilde over(m)} ϵB^(n), and {tilde over (g)}₁ϵA^(n) respectively. It will beconvenient to define {tilde over (r)}=({tilde over (f)}, {tilde over(m)}) so that there is a symbol that represents the noisy measurementson the parent data. It will also be convenient to define {tilde over(D)}=({tilde over (r)}, {tilde over (e)}, {tilde over (g)}₁) so thatthere is a symbol to represent the complete set of noisy measurementsassociated with the problem, and to write {tilde over (D)}_(i)=({tildeover (r)}_(i),{tilde over (e)}_(i), {tilde over (g)}_(1i))=({tilde over(f)}_(i), {tilde over (m)}_(i),{tilde over (e)}_(i), {tilde over(g)}_(1i)) to denote the complete set of measurements on the i-th SNP.(Please note that, while f_(i) is an ordered pair such as (A,C), {tildeover (f)}_(i) is a single letter such as B.)

Because the diploid measurements are unordered, nothing in the data candistinguish the state (f₁, f₂) from (f₂, f₁) or the state (m₁, m₂) from(m₂, m₁). These indistinguishable symmetric states give rise to multipleoptimal solutions of the estimation problem. To eliminate thesymmetries, and without loss of generality, assign θ₁=ϕ₁=1.

In summary, then, the problem is defined by a true but unknownunderlying set of information {r, e, g₁, X}, with e=X(r). Only noisymeasurements {tilde over (D)}=({tilde over (r)},{tilde over (e)},{tildeover (g)}₁) are available. The goal is to come up with an estimate ê ofe, based on {tilde over (D)}.

Note that this method implicitly assumes euploidy on the embryo. Itshould be obvious to one skilled in the art how this method could beused in conjunction with the aneuploidy calling methods describedelsewhere in this patent. For example, the aneuploidy calling methodcould be first employed to ensure that the embryo is indeed euploid andonly then would the allele calling method be employed, or the aneuploidycalling method could be used to determine how many chromosome copieswere derived from each parent and only then would the allele callingmethod be employed. It should also be obvious to one skilled in the arthow this method could be modified in the case of a sex chromosome wherethere is only one copy of a chromosome present.

Solution Via Maximum a Posteriori Estimation

In one embodiment of the invention, it is possible, for each of the nSNP positions, to use a maximum a posteriori (MAP) estimation todetermine the most probable ordered allele pair at that position. Thederivation that follows uses a common shorthand notation for probabilityexpressions. For example, P(e′_(i),{tilde over (D)}|X′) is written todenote the probability that random variable e_(i) takes on value e′_(i)and the random variable {tilde over (D)} takes on its observed value,conditional on the event that the random variable X takes on the valueX′. Using MAP estimation, then, the i-th component of ê, denotedê_(i)=(ê_(1i), ê_(2i)) is given by

$\begin{matrix}{\mspace{20mu} \begin{matrix}{{\hat{e}}_{i} = {\underset{e_{i}^{\prime}}{argmax}{P\left( e_{i}^{\prime} \middle| \overset{\sim}{D} \right)}}} \\{= {\underset{e_{i}^{\prime}}{argmax}{P\left( e_{i}^{\prime} \middle| \overset{\sim}{D} \right)}}} \\{= {\underset{e_{i}^{\prime}}{argmax}{\sum\limits_{X^{\prime} \in \Omega^{3}}{{P\left( X^{\prime} \right)}{P\left( {e_{i}^{\prime},\left. \overset{\sim}{D} \middle| X^{\prime} \right.} \right)}}}}}\end{matrix}} & \; \\{{\mspace{20mu} {{(a) = {\underset{e_{i}^{\prime}}{argmax}{\sum\limits_{{X^{\prime} \in {\Omega^{3}:\theta_{1}^{\prime}}} = {\varphi_{1}^{\prime} = 1}}{{P\left( X^{\prime} \right)}{P\left( {e_{i}^{\prime},\left. {\overset{\sim}{D}}_{i} \middle| X^{\prime} \right.} \right)}{\prod\limits_{j \neq i}{P\left( {\overset{\sim}{D}}_{j} \middle| X^{\prime} \right)}}}}}}{(b) = {\underset{e_{i}^{\prime}}{argmax}{\sum\limits_{{X^{\prime} \in {\Omega^{3}:\theta_{1}^{\prime}}} = {\varphi_{1}^{\prime} = 1}}{{P\left( X^{\prime} \right)}{\sum\limits_{r_{A}^{\prime} \in A^{4}}{{P\left( r_{i}^{\prime} \right)}{P\left( {e_{i}^{\prime},\left. {\overset{\sim}{D}}_{i} \middle| X^{\prime} \right.,r_{i}^{\prime}} \right)}{\prod\limits_{j \neq i}{{\sum\limits_{r_{j}^{\prime} \in A^{4}}{{P\left( r_{j}^{\prime} \right)}{P\left( {\left. {\overset{\sim}{D}}_{j} \middle| X^{\prime} \right.,r_{j}^{\prime}} \right)}}}{(c) = {\underset{e_{i}^{\prime}}{argmax}{\sum\limits_{{X^{\prime} \in {\Omega^{3}:\theta_{i}^{\prime}}} = {\varphi_{1}^{\prime} = 1}}{{P\left( X^{\prime} \right)}{\sum\limits_{r_{i}^{\prime} \in A^{4}}{{P\left( r_{i}^{\prime} \right)}{P\left( {\left. e_{i}^{\prime} \middle| X^{\prime} \right.,r_{i}^{\prime}} \right)}{P\left( {\left. {\overset{\sim}{D}}_{i} \middle| X^{\prime} \right.,r_{i}^{\prime}} \right)}{\prod\limits_{j \neq i}{\sum\limits_{r_{j}^{\prime} \in A^{4}}{{P\left( r_{j}^{\prime} \right)}{P\left( {\left. {\overset{\sim}{D}}_{j} \middle| X^{\prime} \right.,r_{j}^{\prime}} \right)}}}}}}}}}}\left( \quad \right.}}}}}}}}}{*)}} = {\underset{e_{i}^{\prime}}{argmax}{\sum\limits_{{X^{\prime} \in {\Omega^{3}:\theta_{i}^{\prime}}} = {\varphi_{1}^{\prime} = 1}}{{P\left( X^{\prime} \right)}{\prod\limits_{j}{\sum\limits_{r_{j}^{\prime} \in A^{4}}{1\left( {{i \neq {j\mspace{14mu} {or}\mspace{14mu} {X^{\prime}\left( r_{j}^{\prime} \right)}}} = e_{i}^{\prime}} \right){P\left( r_{j}^{\prime} \right)}{P\left( {\left. {\overset{\sim}{D}}_{j} \middle| X^{\prime} \right.,r_{j}^{\prime}} \right)}}}}}}}} & \;\end{matrix}$

In the preceding set of equations, (a) holds because the assumption ofSNP independence means that all of the random variables associated withSNP i are conditionally independent of all of the random variablesassociated with SNP j, given X; (b) holds because r is independent of X;(c) holds because e_(i) and {tilde over (D)}_(i) are conditionallyindependent given r_(i) and X (in particular, e_(i)=X(r_(i))); and (*)holds, again, because e_(i)=X(r_(i)), which means thatP(e′_(i)|X′,r′_(i)) evaluates to either one or zero and henceeffectively filters r′_(i) to just those values that are consistent withe′_(i) and X′.

The final expression (*) above contains three probability expressions:P(X′), P(r′_(j)), and P({tilde over (D)}_(j)|X′,r′_(j)). The computationof each of these quantities is discussed in the following threesections.

Crossover Map Probabilities

Recent research has enabled the modeling of the probability ofrecombination between any two SNP loci. Observations from sperm studiesand patterns of genetic variation show that recombination rates varyextensively on kilobase scales and that much recombination occurs inrecombination hotspots. The NCBI data about recombination rates on theHuman Genome is publicly available through the UCSC Genome AnnotationDatabase.

One may use the data set from the Hapmap Project or the Perlegen HumanHaplotype Project. The latter is higher density; the former is higherquality. These rates can be estimated using various techniques known tothose skilled in the art, such as the reversible-jump Markov Chain MonteCarlo (MCMC) method that is available in the package LDHat.

In one embodiment of the invention, it is possible to calculate theprobability of any crossover map given the probability of crossoverbetween any two SNPs. For example, P(θ=11111222) is one half theprobability that a crossover occurred between SNPs five and six. Thereason it is only half the probability is that a particular crossoverpattern has two crossover maps associated with it: one for each gamete.In this case, the other crossover map is θ=22222111.

Recall that X=(θ,ϕ,ψ), where e₁=θ(f), e₂=ϕ(m), g₁=ψ(f). Obviously θ, ϕ,and ψ result from independent physical events, so P(X)=P(θ)P(ϕ)P(ψ).Further assume that P_(θ)(^(•))=P_(ϕ)(^(•))=P_(ψ)(^(•)), where theactual distribution P_(θ)(^(•)) is determined in the obvious way fromthe Hapmap data.

Allele Probabilities

It is possible to determineP(r_(i))P(f_(i))P(m_(i))=P(f_(i1))P(f_(i2))P(m_(i1))P(m_(i2)) usingpopulation frequency information from databases such as dbSNP. Also, asmentioned previously, choose SNPs for which the assumption ofintra-haploid independence is a reasonable one. That is, assume that

${P(r)} = {\prod\limits_{i}\; {P\left( r_{i} \right)}}$

Measurement Errors

Conditional on whether a locus is heterozygous or homozygous,measurement errors may be modeled as independent and identicallydistributed across all similarly typed loci. Thus:

$\begin{matrix}{{P\left( {\left. \overset{\sim}{D} \middle| X \right.,r} \right)} = {\prod\limits_{i}{P\left( {\left. {\overset{\sim}{D}}_{i} \middle| X \right.,r_{i}} \right)}}} \\{= {\prod\limits_{i}{P\left( {{\overset{\sim}{f}}_{i},{\overset{\sim}{m}}_{i},{\overset{\sim}{e}}_{i},\left. {\overset{\sim}{g}}_{1i} \middle| X \right.,f_{i},m_{i}} \right)}}} \\{= {\prod\limits_{i}{{P\left( {\overset{\sim}{f}}_{i} \middle| f_{i} \right)}{P\left( {\overset{\sim}{m}}_{i} \middle| m_{i} \right)}{P\left( {\left. {\overset{\sim}{e}}_{i} \middle| {\theta \left( f_{i} \right)} \right.,{\varphi \left( m_{i} \right)}} \right)}{P\left( {\overset{\sim}{g}}_{1i} \middle| {\Psi \left( f_{i} \right)} \right)}}}}\end{matrix}$

where each of the four conditional probability distributions in thefinal expression is determined empirically, and where the additionalassumption is made that the first two distributions are identical. Forexample, for unordered diploid measurements on a blastomere, empiricalvalues p_(d)=0.5 and p_(a)=0.02 are obtained, which lead to theconditional probability distribution for P({tilde over (e)}_(i)|e_(i))shown in Table 1.

Note that the conditional probability distributions mentioned above,P({tilde over (f)}_(i)|f_(i)), P({tilde over (m)}_(i)|m_(i)), P({tildeover (e)}_(i)|e_(i)), can vary widely from experiment to experiment,depending on various factors in the lab such as variations in thequality of genetic samples, or variations in the efficiency of wholegenome amplification, or small variations in protocols used. Therefore,in a preferred embodiment, these conditional probability distributionsare estimated on a per-experiment basis. We focus in later sections ofthis disclosure on estimating P({tilde over (e)}_(i)|e_(i)), but it willbe clear to one skilled in the art after reading this disclosure howsimilar techniques can be applied to estimating P({tilde over(f)}_(i)|f_(i)) and P({tilde over (m)}_(i)|m_(i)). The distributions caneach be modeled as belonging to a parametric family of distributionswhose particular parameter values vary from experiment to experiment. Asone example among many, it is possible to implicitly model theconditional probability distribution P({tilde over (e)}_(i)|e_(i)) asbeing parameterized by an allele dropout parameter pd and an alleledropin parameter p_(a). The values of these parameters might vary widelyfrom experiment to experiment, and it is possible to use standardtechniques such as maximum likelihood estimation, MAP estimation, orBayesian inference, whose application is illustrated at various placesin this document, to estimate the values that these parameters take onin any individual experiment. Regardless of the precise method one uses,the key is to find the set of parameter values that maximizes the jointprobability of the parameters and the data, by considering all possibletuples of parameter values within a region of interest in the parameterspace. As described elsewhere in the document, this approach can beimplemented when one knows the chromosome copy number of the targetgenome, or when one doesn't know the copy number call but is exploringdifferent hypotheses. In the latter case, one searches for thecombination of parameters and hypotheses that best match the data arefound, as is described elsewhere in this disclosure.

Note that one can also determine the conditional probabilitydistributions as a function of particular parameters derived from themeasurements, such as the magnitude of quantitative genotypingmeasurements, in order to increase accuracy of the method. This wouldnot change the fundamental concept of the invention.

It is also possible to use non-parametric methods to estimate the aboveconditional probability distributions on a per-experiment basis. Nearestneighbor methods, smoothing kernels, and similar non-parametric methodsfamiliar to those skilled in the art are some possibilities. Althoughthis disclosure focuses parametric estimation methods, use ofnon-parametric methods to estimate these conditional probabilitydistributions would not change the fundamental concept of the invention.The usual caveats apply: parametric methods may suffer from model bias,but have lower variance. Non-parametric methods tend to be unbiased, butwill have higher variance.

Note that it should be obvious to one skilled in the art, after readingthis disclosure, how one could use quantitative information instead ofexplicit allele calls, in order to apply the PS method to makingreliable allele calls, and this would not change the essential conceptsof the disclosure.

B Factoring the Allele Calling Equation

In a preferred embodiment of the invention, the algorithm for allelecalling can be structured so that it can be executed in a morecomputationally efficient fashion. In this section the equations arere-derived for allele-calling via the MAP method, this timereformulating the equations so that they reflect such a computationallyefficient method of calculating the result.

Notation

X*,Y*,Z*ϵ{A,C}^(nx2) are the true ordered values on the mother, father,and embryo respectively.

H*ϵ{A,C}^(nxh) are true values on h sperm samples.

B*ϵ{A,C}^(nxbx2) are true ordered values on b blastomeres.

D={x,y,z,B,H} is the set of unordered measurement data on father,mother, embryo, b blastomeres and h sperm samples.D_(i)={x_(i),y_(i),z_(i),H_(i),B_(i)} is the data set restricted to thei-th SNP.

rϵ{A,C}⁴ represents a candidate 4-tuple of ordered values on both themother and father at a particular locus.

{circumflex over (Z)}_(i)ϵ{A,C}² is the estimated ordered embryo valueat SNP i.

Q=(2+2b+h) is the effective number of haploid chromosomes beingmeasured, excluding the parents. Any hypothesis about the parentalorigin of all measured data (excluding the parents themselves) requiresthat Q crossover maps be specified.

χϵ{1,2}^(nxQ) is a crossover map matrix, representing a hypothesis aboutthe parental origin of all measured data, excluding the parents. Notethat there are 2^(nQ) different crossover matrices. χ_(i)χ_(i), is thematrix restricted to the i-th row. Note that there are 2^(Q) vectorvalues that the i-th row can take on, from the set χϵ{1,2}^(Q).

f(x; y, z) is a function of (x, y, z) that is being treated as afunction of just x. The values behind the semi-colon are constants inthe context in which the function is being evaluated.

PS Equation Factorization

$\begin{matrix}{{\hat{Z}}_{i} = {\underset{z_{i}}{argmax}\; {P\left( {Z_{i},D} \right)}}} \\{= {\underset{z_{i}}{argmax}{\sum\limits_{\chi}{{P(\chi)}{P\left( {Z_{i},\left. D \middle| \chi \right.} \right)}}}}} \\{= {\underset{z_{i}}{argmax}\underset{\chi}{\sum\;}{P\left( \chi_{1} \right)}{P\left( \chi_{2} \middle| \chi_{1} \right)}\mspace{14mu} \ldots \mspace{11mu} {P\left( {\chi_{n},\chi_{n - 1}} \right)}}} \\{{\left( {\text{?}{P(r)}{P\left( {Z_{i},\left. D_{i} \middle| X_{i} \right.,r} \right)}} \right){\prod\limits_{j \neq 1}\; \left( {\sum\limits_{r \in {({A.C.})}^{4}}^{\;}{{P(r)}{P\left( {\left. D_{i} \middle| X_{i} \right.,r} \right)}}} \right)}}} \\{= {\underset{z_{i}}{argmax}\underset{\chi}{\sum\;}{P\left( \chi_{1} \right)}{P\left( \chi_{2} \middle| \chi_{1} \right)}\mspace{11mu} \ldots \mspace{11mu} {P\left( {\chi_{n},\chi_{n - 1}} \right)}{f_{1}\left( {{\chi_{i};Z_{i}},D_{i}} \right)}}} \\{{\prod\limits_{j = 1}^{\;}\; {f_{2}\left( {\chi_{j};D_{j}} \right)}}} \\{= {\underset{z_{i}}{argmax}{\sum\limits_{\chi_{1} \in {\lbrack{1,2}\rbrack}^{Q}}\; {\ldots \mspace{14mu} {\sum\limits_{\chi_{2} \in {\lbrack{1,2}\rbrack}^{Q}}{{P\left( \chi_{1} \right)}{P\left( \chi_{2} \middle| \chi_{1} \right)}\mspace{14mu} \ldots \mspace{11mu} {P\left( {\chi_{n},\chi_{n - 1}} \right)}}}}}}} \\{{{f_{1}\left( {{\chi_{i};Z_{i}},D_{i}} \right)}{\prod\limits_{j = 1}\; {f_{2}\left( {\chi_{j};D_{j}} \right)}}}} \\{= {\underset{z_{i}}{argmax}{\sum\limits_{\chi_{1} \in {\lbrack{1,2}\rbrack}^{Q}}{{P\left( \chi_{1} \right)}{f_{2}\left( {\chi_{1};D_{1}} \right)} \times {\sum\limits_{\chi_{2} \in {\lbrack{1,2}\rbrack}^{Q}}{P\left( \chi_{2} \middle| \chi_{1} \right)}}}}}} \\{{{f_{2}\left( {\chi_{2};D_{2}} \right)} \times \ldots \mspace{11mu} {\sum\limits_{\chi_{1} \in {\lbrack{1,2}\rbrack}^{Q}}{{P\left( \chi_{i} \middle| X_{i - 1} \right)}{f_{1}\left( {{\chi_{i};Z_{i}},D_{i}} \right)} \times}}}} \\{{\sum\limits_{\chi_{n} \in {\lbrack{1,2}\rbrack}^{Q}}{{P\left( \chi_{n} \middle| X_{n - 1} \right)}{f_{2}\left( {\chi_{n};D_{n}} \right)}}}}\end{matrix}$ ?indicates text missing or illegible when filed

The number of different crossover matrices χ is 2^(nQ). Thus, abrute-force application of the first line above is U(n2^(nQ)). Byexploiting structure via the factorization of P(χ) and P(z_(i),D|χ), andinvoking the previous result, final line gives an expression that can becomputed in O(n2^(2Q)).

C Quantitative Detection of Aneuploidy

In one embodiment of the invention, aneuploidy can be detected using thequantitative data output from the PS method discussed in this patent.Disclosed herein are multiple methods that make use of the same concept;these methods are termed Copy Number Calling (CNC). The statement of theproblem is to determine the copy number of each of 23 chromosome-typesin a single cell. The cell is first pre-amplified using a technique suchas whole genome amplification using the MDA method. Then the resultinggenetic material is selectively amplified with a technique such as PCRat a set of n chosen SNPs at each of m=23 chromosome types.

This yields a data set [t_(ij)], i=1 . . . n, j=1 . . . m of regularizedct (ct, or CT, is the point during the cycle time of the amplificationat which dye measurement exceeds a given threshold) values obtained atSNP i, chromosome j. A regularized ct value implies that, for a given(i,j), the pair of raw ct values on channels FAM and VIC (these arearbitrary channel names denoting different dyes) obtained at that locusare combined to yield a ct value that accurately reflects the ct valuethat would have been obtained had the locus been homozygous. Thus,rather than having two ct values per locus, there is just oneregularized ct value per locus.

The goal is to determine the set {n_(j)} of copy numbers on eachchromosome. If the cell is euploid, then n_(j)=2 for all j; oneexception is the case of the male X chromosome. If n_(j)≠2 for at leastone j, then the cell is aneuploid; excepting the case of male X.

Biochemical Model

The relationship between ct values and chromosomal copy number ismodeled as follows: α_(ij)n_(j)Q2^(βijtij)-Q_(T) In this expression,n_(j) is the copy number of chromosome j. Q is an abstract quantityrepresenting a baseline amount of pre-amplified genetic material fromwhich the actual amount of pre-amplified genetic material at SNP i,chromosome j can be calculated as α_(ij)n_(j)Q. α_(ij) is a preferentialamplification factor that specifies how much more SNP i on chromosome jwill be pre-amplified via MDA than SNP 1 on chromosome 1. By definition,the preferential amplification factors are relative to α₁₁

1.

β_(ij) is the doubling rate for SNP i chromosome j under PCR. t_(ij) isthe ct value. Q_(T) is the amount of genetic material at which the ctvalue is determined. T is a symbol, not an index, and merely stands forthreshold.

It is important to realize that α_(ij), β_(ij), and Q_(T) are constantsof the model that do not change from experiment to experiment. Bycontrast, n_(j) and Q are variables that change from experiment toexperiment. Q is the amount of material there would be at SNP 1 ofchromosome 1, if chromosome 1 were monosomic.

The original equation above does not contain a noise term. This can beincluded by rewriting it as follows:

${{{\left( \quad \right.{*)}}\beta_{ij}t_{ij}} = {{\log \; \frac{Q_{T}}{\alpha_{{ij}\;}}} - {\log \; n_{j}}}},{{\log \; Q} + Z_{ij}}$

The above equation indicates that the ct value is corrupted by additiveGaussian noise Z_(ij). Let the variance of this noise term be σ_(ij) ².

Maximum Likelihood (ML) Estimation of Copy Number

In one embodiment of the method, the maximum likelihood estimation isused, with respect to the model described above, to determine n_(j). Theparameter Q makes this difficult unless another constraint is added:

${\frac{1}{m}{\sum\limits_{j}{\log \; n_{j}}}} = 1$

This indicates that the average copy number is 2, or, equivalently, thatthe average log copy number is 1. With this additional constraint onecan now solve the following ML problem:

$\begin{matrix}{\hat{Q},{{\hat{n}}_{j} = {{\underset{Q,n_{j}}{argmax}{\prod\limits_{ij}{{f_{z}\begin{pmatrix}{{\log \; n_{j}} + {\log \; Q} -} \\\left( {{\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\beta_{ij}t_{ij}}} \right)\end{pmatrix}}\mspace{14mu} {s.t.\mspace{14mu} \frac{1}{m}}{\sum\limits_{j}{\log \; n_{j}}}}}} = 1}}} \\{= {{\underset{Q,n_{j}}{argmax}{\sum\limits_{ij}{\frac{1}{\sigma_{ij}^{2}\;}\begin{pmatrix}{{\log \; n_{j}} + {\log \; Q} -} \\\left( {{\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\beta_{ij}t_{ij}}} \right)\end{pmatrix}^{2}\mspace{14mu} {s.t.\mspace{14mu} \frac{1}{m}}{\sum\limits_{j}{\log \; n_{j}}}}}} = 1}}\end{matrix}$

The last line above is linear in the variables log n₁ and log Q, and isa simple weighted least squares problem with an equality constraint. Thesolution can be obtained in closed form by forming the Lagrangian

${L\left( {{\log \; n_{j}},{\log \; Q}} \right)} = {{\sum\limits_{ij}{\frac{1}{\sigma_{ij}^{2}}\begin{pmatrix}{{\log \; n_{j}} + {\log \; Q} -} \\\left( {{\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\beta_{ij}t_{ij}}} \right)\end{pmatrix}^{2}}} + {\lambda {\sum\limits_{j}{\log \; n_{j}}}}}$

and taking partial derivatives.Solution when Noise Variance is Constant

To avoid unnecessarily complicating the exposition, set σ_(ij) ²=1. Thisassumption will remain unless explicitly stated otherwise. (In thegeneral case in which each σ_(ij) ² is different, the solutions will beweighted averages instead of simple averages, or weighted least squaressolutions instead of simple least squares solutions.) In that case, theabove linear system has the solution:

${\log \; Q_{j}}\overset{\Delta}{=}{\frac{1}{n}{\sum\limits_{i}\left( {{\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\beta_{ij}t_{ij}}} \right)}}$${\log \; Q} = {{\frac{1}{m}{\sum\limits_{j}{\log \; Q_{j}}}} - 1}$${\log \; n_{j}} = {{{\log \; Q_{j}} - {\log \; Q}} = {\log \; \frac{Q_{j}}{Q}}}$

The first equation can be interpreted as a log estimate of the quantityof chromosome j. The second equation can be interpreted as saying thatthe average of the Q_(j) is the average of a diploid quantity;subtracting one from its log gives the desired monosome quantity. Thethird equation can be interpreted as saying that the copy number is justthe ratio

$\frac{Q_{j}}{Q}.$

Note that n_(j) is a ‘double difference’, since it is a difference ofQ-values, each of which is itself a difference of values.

Simple Solution

The above equations also reveal the solution under simpler modelingassumptions: for example, when making the assumption α_(ij)=1 for all iand j and/or when making the assumption that β_(ij)=β for all i and j.In the simplest case, when both α_(ij)=1 and β_(ij)=β, the solutionreduces to

${{(*}{*)}}{{\log \; n_{j}} = {1 + {\beta\left( {{\frac{1}{mn}{\sum\limits_{ij}t_{ij}}} - {\frac{1}{n}{\sum\limits_{i}t_{ij}}}} \right)}}}$

The Double Differencing Method

In one embodiment of the invention, it is possible to detect monosomyusing double differencing. It should be obvious to one skilled in theart how to modify this method for detecting other aneuploidy states. Let{t_(ij)} be the regularized ct values obtained from MDApre-amplification followed by PCR on the genetic sample. As always,t_(ij) is the ct value on the i-th SNP of the j-th chromosome. Denote byt_(j) the vector of ct values associated with the j-th chromosome. Makethe following definitions:

$\overset{\sim}{t}\overset{\Delta}{=}{\frac{1}{mn}{\sum\limits_{ij}t_{ij}}}$${\overset{\sim}{t}}_{j}\overset{\Delta}{=}{t_{j} - {\overset{\sim}{t}\; 1}}$

Classify chromosome j as monosomic if and only if f^(T){tilde over(t)}_(i), is higher than a certain threshold value, where f is a vectorthat represents a monosomy signature. f is the matched filter, whoseconstruction is described next.

The matched filter f is constructed as a double difference of valuesobtained from two controlled experiments. Begin with known quantities ofeuploid male genetic data and euploid female genetic material. Assumethere are large quantities of this material, and pre-amplification canbe omitted. On both the male and female material, use PCR to sequence nSNPs on both the X chromosome (chromosome 23), and chromosome 7. Let{t_(ij) ^(X)}, i=1 . . . n, jϵ{7, 23} denote the measurements on thefemale, and let {t_(ij) ^(Y)} similarly denote the measurements on themale. Given this, it is possible to construct the matched filter f fromthe resulting data as follows:

${\overset{\sim}{t}}_{7}^{X}\overset{\Delta}{=}{\frac{1}{n}{\sum\limits_{i}t_{i,7}^{X}}}$${\overset{\sim}{t}}_{7}^{Y}\overset{\Delta}{=}{\frac{1}{n}{\sum\limits_{i}t_{i,7}^{Y}}}$$\Delta^{X}\overset{\Delta}{=}{t_{23}^{X} - {{\overset{\sim}{t}}_{7}^{X}1}}$$\Delta^{Y}\overset{\Delta}{=}{t_{23}^{Y} - {{\overset{\sim}{t}}_{7}^{Y}1}}$$f\overset{\Delta}{=}{\Delta^{Y} - \Delta^{X}}$

In the above, equations, t₇ ^(X) and t₇ ^(Y) are scalars, while Δ^(X)and Δ^(Y) are vectors. Note that the superscripts X and Y are justsymbolic labels, not indices, denoting female and male respectively. Donot to confuse the superscript X with measurements on the X chromosome.The X chromosome measurements are the ones with subscript 23.

The next step is to take noise into account and to see what remnants ofnoise survive in the construction of the matched filter f as well as inthe construction of {tilde over (t)}_(j). In this section, consider thesimplest possible modeling assumption: that β_(ij)=β for all i and j,and that α_(ij)=1 for all i and j. Under these assumptions, from (*)above: βt_(ij)=log Q_(T)-log n_(j)-log Q+Z_(ij) Which can be rewrittenas:

$t_{ij} = {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta \;}\log \; n_{j}} - {\frac{1}{\beta}\log \; Q} + Z_{ij}}$

In that case, the i-th component of the matched filter f is given by:

$f_{i}\overset{\Delta}{=}{{\Delta_{i}^{Y} - \Delta_{i}^{X}} = {\left\{ {t_{i,23}^{Y} - {\overset{\sim}{t}}_{7}^{Y}} \right\} - \left\{ {\left( {t_{i,23}^{X} - {\overset{\sim}{t}}_{7}^{X}} \right\} = {\left\{ {\left( {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{23}^{Y}} - {\frac{1}{\beta}\log \; Q^{Y}} + Z_{i,23}^{Y}} \right) - {\frac{1}{n}{\sum\limits_{i}\left( {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{7}^{Y}} - {\frac{1}{\beta}\log \; Q^{Y}} + Z_{i,7}^{Y}} \right)}}} \right\} - \left\{ {\left( {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{23}^{X}} - {\frac{1}{\beta}\log \; Q^{X}} + Z_{i,23}^{X}} \right) - {\frac{1}{n}{\sum\limits_{i}\left( {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{7}^{X}} - {\frac{1}{\beta}\log \; Q^{X}} + Z_{i,7}^{X}} \right)}}} \right\} - \left\{ {\left( {\frac{1}{\beta} + Z_{i,23}^{Y}} \right) - {\frac{1}{n}{\sum\limits_{i}Z_{i,7}^{Y}}}} \right\} - \left\{ {Z_{i,23}^{X} - {\frac{1}{n}{\sum\limits_{i}Z_{i,7}^{X}}}} \right\}}} \right.}}$

Note that the above equations take advantage of the fact that all thecopy number variables are known, for example, n₂₃ ^(Y)=1 and that n₂₃^(X)=2.

Given that all the noise terms are zero mean, the ideal matched filteris 1/β1. Further, since scaling the filter vector doesn't really changethings, the vector 1 can be used as the matched filter. This isequivalent to simply taking the average of the components of {tilde over(t)}_(j). In other words, the matched filter paradigm is not necessaryif the underlying biochemistry follows the simple model. In addition,one may omit the noise terms above, which can only serve to lower theaccuracy of the method. Accordingly, this gives:

${\overset{\sim}{t}}_{ij}\overset{\Delta}{=}{{t_{j} - \overset{\sim}{t}} = {{\left\{ {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{i}} - {\frac{1}{\beta}\log \; Q} + Z_{ij}} \right\} - {\frac{1}{mn}{\sum\limits_{i,j}\left\{ {{\frac{1}{\beta}\log \; Q_{T}} - {\frac{1}{\beta}\log \; n_{i}} - {\frac{1}{\beta}\log \; Q} + Z_{ij}} \right\}}}} = {{\frac{1}{\beta}\left( {1 - {\log \; n_{j}}} \right)} + Z_{ij} - {\frac{1}{mn}{\sum\limits_{i,j}Z_{ij}}}}}}$

In the above, it is assumed that

${\frac{1}{mn}{\sum\limits_{i,j}{\log \; n_{j}}}} = 1.$

that is, that the average copy number is 2.Each element of the vector is an independent measurement of the log copynumber (scaled by 1/β), and then corrupted by noise. The noise termZ_(ij) cannot be gotten rid of: it is inherent in the measurement. Thesecond noise term probably cannot be gotten rid of either, sincesubtracting out t is necessary to remove the nuisance term

$\frac{1}{\beta}\log \; {Q.}$

Again, note that, given the observation that each element of {tilde over(t)}_(j) is an independent measurement of

${\frac{1}{\beta}\left( {1 - {\log \; n_{j}}} \right)},$

it is clear that a UMVU (uniform minimum variance unbiased) estimate of

$\frac{1}{\beta}\left( {1 - {\log \; n_{j}}} \right)$

is just the average of the elements of {tilde over (t)}_(j). (In thecase in which each σ_(ij) ² is different, it will be a weightedaverage.) Thus, performing a little bit of algebra, the UMVU estimatorfor log n_(j) is given by:

$\left. {{\frac{1}{n}{\sum\limits_{i}{\overset{\sim}{t}}_{ij}}} \approx {\frac{1}{\beta}\left( {1 - {\log \; n_{j}}} \right)}}\Rightarrow{{\log \; n_{j}} \approx {1 - {{\beta \cdot \frac{1}{n}}{\sum\limits_{i,j}{\overset{\sim}{t}}_{ij}}}}} \right. = {1 - {\beta\left( {{\frac{1}{n}{\sum\limits_{i}t_{ij}}} - {\frac{1}{mn}{\sum\limits_{i,j}t_{ij}}}} \right)}}$

Analysis Under the Complicated Model

Now repeat the preceding analysis with respect to a biochemical model inwhich each β_(ij) and α_(ij) is different. Again, take noise intoaccount and to see what remnants of noise survive in the construction ofthe matched filter f as well as in the construction of j. Under thecomplicated model, from (*) above:

${\beta_{ij}t_{ij}} = {{\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\log \; n_{j}} - {\log \; Q} + Z_{ij}}$

Which can be rewritten as:

${{(*}{**)}}{t_{ij} = {{\frac{1}{\beta_{ij}}\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\frac{1}{\beta_{ij}}\log \; n_{j}} - {\frac{1}{\beta_{ij}}\log \; Q} + Z_{ij}}}$

The i-th component of the matched filter f is Oven by:

$f_{i}\overset{\Delta}{=}{\Delta_{i}^{Y} - \Delta_{i}^{X} - \left\{ {t_{i,23}^{Y} - {\overset{\sim}{t}}_{7}^{Y}} \right\} - \left\{ {\left( {t_{i,23}^{X} - {\overset{\sim}{t}}_{7}^{X}} \right\} = {{\left\{ {\left( {{\frac{1}{\beta_{i,23}}\log \; \frac{Q_{T}}{\alpha_{i,23}}} - {\frac{1}{\beta_{i,23}}\log \; n_{23}^{Y}} - {\frac{1}{\beta_{i,23}}\log \; Q^{Y}} + Z_{i,23}^{Y}} \right) - {\frac{1}{n}{\sum\limits_{i}\left( {{\frac{1}{\beta_{i,7}}\log \; \frac{Q_{T}}{\alpha_{i,7}}} - {\frac{1}{\beta_{i,7}}\log \; n_{7}^{Y}} - {\frac{1}{\beta_{i,7}}\log \; Q^{Y}} + Z_{i,7}^{Y}} \right)}}} \right\} - \left\{ {\left( {{\frac{1}{\beta_{i,23}}\log \; \frac{Q_{T}}{\alpha_{i,23}}} - {\frac{1}{\beta_{i,23}}\log \; n_{23}^{X}} - {\frac{1}{\beta_{i,23}}\log \; Q^{X}} + Z_{i,23}^{X}} \right) - {\frac{1}{n}{\sum\limits_{i}\left( {{\frac{1}{\beta_{i,7}}\log \; \frac{Q_{T}}{\alpha_{i,7}}} - {\frac{1}{\beta_{i,7}}\log \; n_{7}^{X}} - {\frac{1}{\beta_{i,7}}\log \; Q^{X}} + Z_{i,7}^{X}} \right)}}} \right\}} = {\frac{1}{\beta_{i,23}} + {\left( {\frac{1}{\beta_{i,23}} - \left( {\frac{1}{n}{\sum\limits_{i}\frac{1}{\beta_{i,7}}}} \right)} \right)\log \; \frac{Q^{Y}}{Q^{X}}} + \left\{ {Z_{i,23}^{Y} - Z_{i,23}^{X} + {\frac{1}{n\;}{\sum\limits_{i}Z_{i,7}^{X}}} - {\frac{1}{n}{\sum\limits_{i}Z_{i,7}^{Y}}}} \right\}}}} \right.}$

Under the complicated model, this gives:

${\overset{\sim}{t}}_{ij}\overset{\Delta}{=}{{t_{j} - \overset{\sim}{t}} = {\left\{ {{\frac{1}{B_{ij}}\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\frac{1}{\beta_{ij}}\log \; n_{j}} - {\frac{1}{\beta_{ij}}\log \; Q} + Z_{ij}} \right\} - {\frac{1}{mn}{\sum\limits_{i,j}\left\{ {{\frac{1}{\beta_{ij}}\log \; \frac{Q_{T}}{\alpha_{ij}}} - {\frac{1}{\beta_{ij}}\log \; n_{j}} - {\frac{1}{\beta_{ij}}\log \; Q} + Z_{ij}} \right\}}}}}$

An Alternate Way to Regularize CT Values

In another embodiment of the method, one can average the CT valuesrather than transforming to exponential scale and then taking logs, asthis distorts the noise so that it is no longer zero mean. First, startwith known Q and solve for betas. Then do multiple experiments withknown n_j to solve for alphas. Since aneuploidy is a whole set ofhypotheses, it is convenient to use ML to determine the most likely n_jand Q values, and then use this as a basis for calculating the mostlikely aneuploid state, e.g., by taking the n_j value that is most offfrom 1 and pushing it to its nearest aneuploid neighbor.

Estimation of the Error Rates in the Embryonic Measurements.

In one embodiment of the invention, it is possible to determine theconditional probabilities of particular embryonic measurements givenspecific underlying true states in embryonic DNA. In certain contexts,the given data consists of (i) the data about the parental SNP states,measured with a high degree of accuracy, and (ii) measurements on all ofthe SNPs in a specific blastomere, measured poorly.

Use the following notation: U is any specific homozygote, U is the otherhomozygote at that SNP, H is the heterozygote. The goal is to determinethe probabilities (p_(j)) shown in Table 2. For instance p₁₁ is theprobability of the embryonic DNA being U and the readout being U aswell. There are three conditions that these probabilities have tosatisfy:

p ₁₁ +p ₁₂ +p ₁₃ +p ₁₄=1  (1)

P ₂₁ +p ₂₂-p ₂₃ +p ₂₄=1  (2)

p ₂₁ =p ₂₃  (3)

The first two are obvious, and the third is the statement of symmetry ofheterozygote dropouts (H should give the same dropout rate on average toeither U or Ū).

There are 4 possible types of matings: U×U, U×Ū, U×H, H×H. Split all ofthe SNPs into these 4 categories depending on the specific mating type.Table 3 shows the matings, expected embryonic states, and thenprobabilities of specific readings (p_(ij)). Note that the first tworows of this table are the same as the two rows of the Table 2 and thenotation (p_(u)) remains the same as in Table 2.

Probabilities p_(3i) and p_(4i) can be written out in terms of p_(li)and p_(2i).

p ₃₁=½[p ₁₁ +p ₂₁]  (4)

p ₃₂=½[p ₁₂ +p ₂₂]  (5)

p ₃₃=½[p ₁₃ +p ₂₃]  (6)

p ₃₄=½[p ₁₄ +p ₂₄]  (7)

p ₄₁=¼[p ₁₁+2p ₂₁ +p ₁₃]  (8)

p ₄₂=½[p ₁₂ +p ₂₂]  (9)

p ₄₃=¼[p ₁₁+2p ₂₃ +p ₁₃]  (10)

P ₄₄=½[p ₁₄ +p ₂₄]  (11)

These can be thought of as a set of 8 linear constraints to add to theconstraints (1), (2), and (3) listed above. If a vector P=[p₁₁, p₁₂,p₁₃, p₁₄, p₂₁ . . . , p₄₄]^(T) (16×1 dimension) is defined, then thematrix A (11×16) and a vector C can be defined such that the constraintscan be represented as:

AP=C  (12)

C=[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]^(T). Specifically, Ais shown in Table 4, where empty cells have zeroes.

The problem can now be framed as that of finding P that would maximizethe likelihood of the observations and that is subject to a set oflinear constraints (AP=C). The observations come in the same 16 types asp_(ij). These are shown in Table 5. The likelihood of making a set ofthese 16 n_(ij) observations is defined by a multinomial distributionwith the probabilities p_(ij) and is proportional to:

$\begin{matrix}{{L\left( {P,n_{ij}} \right)} \propto {\sum\limits_{ij}p_{ij}^{n_{ij}}}} & (13)\end{matrix}$

Note that the full likelihood function contains multinomial coefficientsthat are not written out given that these coefficients do not depend onP and thus do not change the values within P at which L is maximized.The problem is then to find:

$\begin{matrix}{{\max\limits_{P}\left\lbrack {L\left( {P,n_{ij}} \right)} \right\rbrack} = {{\max\limits_{P}\left\lbrack {\ln \left( {L\left( {P,n_{ij}} \right)} \right)} \right\rbrack} = {\max\limits_{P}\left( {\sum\limits_{ij}{n_{ij}{\ln \left( p_{ij} \right)}}} \right)}}} & (14)\end{matrix}$

subject to the constraints AP=C.Note that in (14) taking the In of L makes the problem more tractable(to deal with a sum instead of products). This is standard given thatvalue of x such that f(x) is maximized is the same for which ln(f(x)) ismaximized. P(n_(j),Q,D)=P(n_(j))P(Q)P(D_(j)|Q,n_(j))P(D_(k≠j)|Q).D MAP Detection of Aneuploidy without Parents

In one embodiment of the invention, the PS method can be applied todetermine the number of copies of a given chromosome segment in a targetwithout using parental genetic information. In this section, a maximuma-posteriori (MAP) method is described that enables the classificationof genetic allele information as aneuploid or euploid. The method doesnot require parental data, though when parental data are available theclassification power is enhanced. The method does not requireregularization of channel values. One way to determine the number ofcopies of a chromosome segment in the genome of a target individual byincorporating the genetic data of the target individual and relatedindividual(s) into a hypothesis, and calculating the most likelyhypothesis is described here. In this description, the method will beapplied to ct values from TAQMAN measurements; it should be obvious toone skilled in the art how to apply this method to any kind ofmeasurement from any platform. The description will focus on the case inwhich there are measurements on just chromosomes X and 7; again, itshould be obvious to one skilled in the art how to apply the method toany number of chromosomes and sections of chromosomes.

Setup of the Problem

The given measurements are from triploid blastomeres, on chromosomes Xand 7, and the goal is to successfully make aneuploidy calls on these.The only “truth” known about these blastomeres is that there must bethree copies of chromosome 7. The number of copies of chromosome X isnot known.

The strategy here is to use MAP estimation to classify the copy numberN₇ of chromosome 7 from among the choices {1,2,3} given the measurementsD. Formally that looks like this:

${\hat{n}}_{7} = {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{P\left( {n_{7},D} \right)}}$

Unfortunately, it is not possible to calculate this probability, becausethe probability depends on the unknown quantity Q. If the distribution fon Q were known, then it would be possible to solve the following:

${\hat{n}}_{7} = {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}{P\left( {n_{7},\left. D \middle| Q \right.} \right)}{dQ}}}}$

In practice, a continuous distribution on Q is not known. However,identifying Q to within a power of two is sufficient, and in practice aprobability mass function (pmf) on Q that is uniform on say {2¹, 2² . .. , 2⁴⁰} can be used. In the development that follows, the integral signwill be used as though a probability distribution function (pdf) on Qwere known, even though in practice a uniform pmf on a handful ofexponential values of Q will be substituted.

This discussion will use the following notation and definitions: N₇ isthe copy number of chromosome seven. It is a random variable. n₇ denotesa potential value for N₇. N_(X) is the copy number of chromosome X.n_(X) denotes a potential value for N_(X). N_(j) is the copy number ofchromosome-j, where for the purposes here jϵ{7,X}. n_(j) denotes apotential value for N_(j). D is the set of all measurements. In onecase, these are TAQMAN measurements on chromosomes X and 7, so thisgives D={D₇,D_(X)}, where D_(j)={t_(ij) ^(A),t_(ij) ^(C)} is the set ofTAQMAN measurements on this chromosome. t_(ij) ^(A) is the ct value onchannel-A of locus i of chromosome-j. Similarly, t_(ij) ^(C) is the ctvalue on channel-C of locus i of chromosome-j. (A is just a logical nameand denotes the major allele value at the locus, while C denotes theminor allele value at the locus.) Q represents a unit-amount of geneticmaterial such that, if the copy number of chromosome-j is n_(j), thenthe total amount of genetic material at any locus of chromosome-j isn_(j)Q. For example, under trisomy, if a locus were AAC, then the amountof A-material at this locus would be 2Q, the amount of C-material atthis locus is Q, and the total combined amount of genetic material atthis locus is 3Q. (n^(A),n^(C)) denotes an unordered allele patterns ata locus when the copy number for the associate chromosome is n. n^(A) isthe number of times allele A appears on the locus and n^(C) is thenumber of times allele C appears on the locus. Each can take on valuesin 0, . . . , n, and it must be the case that n^(A)+n^(C)=n. Forexample, under trisomy, the set of allele patterns is {(0,3), (1,2),(2,1), (3,0)}. The allele pattern (2,1) for example corresponds to alocus value of A^(2C), i.e., that two chromosomes have allele value Aand the third has an allele value of C at the locus. Under disomy, theset of allele patterns is {(0,2), (1,1), (2,0)}. Under monosomy, the setof allele patterns is {(0,1), (1,0)}. QT is the (known) threshold valuefrom the fundamental TAQMAN equation Q₀2^(βt)=Q_(T). β is the (known)doubling-rate from the fundamental TAQMAN equation Q₀2^(βt)=Q_(T). ⊥(pronounced “bottom”) is the ct value that is interpreted as meaning “nosignal”.

f_(Z)(χ) is the standard normal Gaussian pdf evaluated at χ.σ is the (known) standard deviation of the noise on TAQMAN ct values.

MAP Solution

In the solution below, the following assumptions have been made:

N₇ and N_(χ) are independent.

Allele values on neighboring loci are independent.

The goal is to classify the copy number of a designated chromosome. Inthis case, the description will focus on chromosome 7. The MAP solutionis given by:

$\begin{matrix}{{\hat{n}}_{7} = {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}{P\left( {n_{7},\left. D \middle| Q \right.} \right)}{dQ}}}}} \\{= {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}{\sum\limits_{n_{X} \in {\{{1,2,3}\}}}{{P\left( {n_{7},n_{X},\left. D \middle| Q \right.} \right)}{dQ}}}}}}} \\{= {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}{\sum\limits_{n_{X} \in {\{{1,2,3}\}}}{{P\left( n_{7} \right)}{P\left( n_{X} \right)}{P\left( {\left. D_{7} \middle| Q \right.,n_{7}} \right)}{P\left( {\left. D_{X} \middle| Q \right.,n_{X}} \right)}{dQ}}}}}}} \\{= {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}\left( {{P\left( n_{7} \right)}{P\left( {\left. D_{7} \middle| Q \right.,n_{7}} \right)}} \right)}}}} \\{{\left( {\sum\limits_{n_{X} \in {\{{1,2,3}\}}}{{P\left( n_{X} \right)}{P\left( {\left. D_{X} \middle| Q \right.,n_{X}} \right)}}} \right){dQ}}} \\{= {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}\left( {{P\left( n_{7} \right)}{\prod\limits_{i}{P\left( {t_{i,7}^{A},\left. t_{i,7}^{C} \middle| Q \right.,n_{7}} \right)}}} \right)}}}} \\{{\left( {\sum\limits_{n_{Z} \in {\{{1,2,3}\}}}{{P\left( n_{X} \right)}{\prod\limits_{i}{P\left( {t_{i,X}^{A},\left. t_{i,X}^{C} \middle| Q \right.,n_{X}} \right)}}}} \right){dQ}\left. {(*} \right)}} \\{= {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{\int{{f(Q)}\begin{pmatrix}{{P\left( n_{7} \right)}{\prod\limits_{i}{\sum\limits_{{n^{A} + n^{C}} = n_{7}}{P\left( {n^{A},\left. n^{C} \middle| n_{7} \right.,i} \right)}}}} \\{{P\left( {\left. t_{i,7}^{A} \middle| Q \right.,n^{A}} \right)}{P\left( {\left. t_{i,7}^{C} \middle| Q \right.,n^{C}} \right)}}\end{pmatrix} \times}}}} \\{{\begin{pmatrix}{\sum\limits_{n_{X} \in {\{{1,2,3}\}}}{{P\left( n_{X} \right)}{\prod\limits_{i}{\sum\limits_{{n^{A} + n^{C}} = n_{X}}{P\left( {n^{A},\left. n^{C} \middle| n_{X} \right.,i} \right)}}}}} \\{P\left( {\left. t_{i,X}^{A} \middle| Q \right.,n^{A}} \right){P\left( {\left. t_{i,X}^{C} \middle| Q \right.,n^{C}} \right)}}\end{pmatrix}{dQ}}}\end{matrix}$

Allele Distribution Model

Equation (*) depends on being able to calculate values forP(n^(A),n^(C)ln₇,i) and P(n^(A),n^(C)ln_(X),i). These values may becalculated by assuming that the allele pattern (n^(A),n^(C)) is drawni.i.d (independent and identically distributed) according to the allelefrequencies for its letters at locus i. An example should suffice toillustrate this. Calculate P((2,1)ln₇=3) under the assumption that theallele frequency for A is 60%, and the minor allele frequency for C is40%. (As an aside, note that P((2,1)ln₇-2)-0, since in this case thepair must sum to 2.) This probability is given by

${P\left( {\left. \left( {2,1} \right) \middle| n_{7} \right. = 3} \right)} = {\begin{pmatrix}3 \\2\end{pmatrix}({.60})^{Z}({.40})}$

The general equation is

${P\left( {n^{A},\left. n^{C} \middle| n_{j} \right.,i} \right)} = {\begin{pmatrix}n \\n^{A}\end{pmatrix}\left( {1 - p_{ij}} \right)^{n^{A}}\left( p_{ij} \right)^{n^{C}}}$

Where p_(i,j) is the minor allele frequency at locus i of chromosome j.

Error Model

Equation (*) depends on being able to calculate values forP(t^(A)|Q,n^(A)) and P(t^(C)|Q,n^(C)). For this an error model isneeded. One may use the following error model:

${P\left( {\left. t^{A} \middle| Q \right.,n^{A}} \right)} = \begin{Bmatrix}p_{d} & {t^{A} = {\bot\mspace{14mu} {{and}\mspace{14mu} n^{A}} > 0}} \\{\left( {1 - p_{a}} \right){f_{Z}\left( {\frac{1}{\sigma}\left( {t^{A} - {\frac{1}{\beta}\log \; \frac{Q_{T}}{n^{A}Q}}} \right)} \right)}} & {t^{A} \neq \bot\mspace{14mu} {{and}\mspace{14mu} n^{A}} > 0} \\{1 - p_{a}} & {t^{A} = {{\bot\mspace{14mu} {{and}\mspace{14mu} n^{A}}} = 0}} \\{2p_{a}{f_{Z}\left( {\frac{1}{\sigma}\left( {{t^{A} -}\bot} \right)} \right)}} & {{t^{A} \neq \bot\mspace{14mu} {{and}\mspace{14mu} n^{A}}} = 0}\end{Bmatrix}$

Each of the four cases mentioned above is described here. In the firstcase, no signal is received, even though there was A-material on thelocus. That is a dropout, and its probability is therefore p_(d). In thesecond case, a signal is received, as expected since there wasA-material on the locus. The probability of this is the probability thata dropout does not occur, multiplied by the pdf for the distribution onthe ct value when there is no dropout. (Note that, to be rigorous, oneshould divide through by that portion of the probability mass on theGaussian curve that lies below ⊥, but this is practically one, and willbe ignored here.) In the third case, no signal was received and therewas no signal to receive. This is the probability that no drop-inoccurred, 1-p_(a). In the final case, a signal is received even throughthere was no A-material on the locus. This is the probability of adrop-in multiplied by the pdf for the distribution on the ct value whenthere is a drop-in. Note that the ‘2’ at the beginning of the equationoccurs because the Gaussian distribution in the case of a drop-in ismodeled as being centered at ⊥. Thus, only half of the probability masslies below ⊥ in the case of a drop-in, and when the equation isnormalized by dividing through by one-half, it is equivalent tomultiplying by 2. The error model for P(t^(C)|Q,n^(C)) by symmetry isthe same as for P(t^(A)|Q,n^(A)) above. It should be obvious to oneskilled in the art how different error models can be applied to a rangeof different genotyping platforms, for example the ILLUMINA INFINIUMgenotyping platform.

Computational Considerations

In one embodiment of the invention, the MAP estimation mathematics canbe carried out by brute-force as specified in the final MAP equation,except for the integration over Q. Since doubling Q only results in adifference in ct value of 1/β, the equations are sensitive to Q only onthe log scale. Therefore to do the integration it should be sufficientto try a handful of Q-values at different powers of two and to assume auniform distribution on these values. For example, one could start atQ=Q_(T)2^(−20β), which is the quantity of material that would result ina ct value of 20, and then halve it in succession twenty times, yieldinga final Q value that would result in a ct value of 40.

What follows is a re-derivation of a derivation described elsewhere inthis disclosure, with slightly difference emphasis, for elucidating theprogramming of the math. Note that the variable D below is not really avariable. It is always a constant set to the value of the data setactually in question, so it does not introduce another array dimensionwhen representing in MATLAB. However, the variables D_(j) do introducean array dimension, due to the presence of the index j.

$\mspace{79mu} {{\hat{n}}_{7} = {\underset{n_{7} \in {\{{1,2,3}\}}}{argmax}{P\left( {n_{7},D} \right)}}}$$\mspace{79mu} {{P\left( {n_{7},D} \right)} = {\sum\limits_{Q}{P\left( {n_{7},Q,D} \right)}}}$     P(n₇, Q, D) = P(n₇)P(Q)P(D₇Q, n₇)P(D_(X)Q)$\mspace{79mu} {{P\left( {D_{j}Q} \right)} - {\sum\limits_{n_{j} \in {\{{1,2,3}\}}}{P\left( {D_{j},{n_{j}Q}} \right)}}}$     P(D_(j), n_(j)Q) = P(n_(j))P(D_(j)Q, n_(j))$\mspace{79mu} {{P\left( {{D_{j}Q},n_{j}} \right)} = {\prod\limits_{i}\; {P\left( {{D_{ij}Q},n_{j}} \right)}}}$$\mspace{79mu} {{P\left( {{D_{ij}Q},n_{j}} \right)} = {\sum\limits_{{n^{A} + n^{C}} = n_{j}}{P\left( {D_{ij},n^{A},{n^{C}Q},n_{j}} \right)}}}$$\mspace{79mu} {{P\begin{pmatrix}{D_{ij},n^{A},} \\{{n^{C}Q},n_{j}}\end{pmatrix}} = {{P\left( {n^{A},{n^{C}n_{j}},i} \right)}{P\left( {{t_{ij}^{A}Q},n^{A}} \right)}{P\left( {{t_{ij}^{C}Q},n^{C}} \right)}}}$$\mspace{79mu} {{P\left( {n^{A},{n^{C}n_{j}},i} \right)} = {\begin{pmatrix}n \\n^{A}\end{pmatrix}\left( {1 - p_{ij}} \right)^{n^{A}}\left( p_{ij} \right)^{n^{C}}}}$${P\begin{pmatrix}{{t_{ij}^{A}Q},} \\n^{A}\end{pmatrix}} = \begin{Bmatrix}p_{d} & {t^{A} = {\bot{{and}\mspace{14mu} n^{A}} > 0}} \\{\left( {1 - p_{d}} \right){{fz}\left( {\frac{1}{\sigma}\left( {t^{A} - {\frac{1}{\beta}\log \; \frac{Q_{T}}{n^{A}Q}}} \right)} \right)}} & {t^{A} \neq \bot{{and}\mspace{14mu} n^{A}} > 0} \\{1 - p_{a}} & {t^{A} = {{\bot{{and}\mspace{14mu} n^{A}}} = 0}} \\{2p_{z}{f_{Z}\left( {\frac{1}{\sigma}\left( {{t^{A} -}\bot} \right)} \right)}} & {{t^{A} \neq \bot{{and}\mspace{14mu} n^{A}}} = 0}\end{Bmatrix}$

E MAP Detection of Aneuploidy with Parental Info

In one embodiment of the invention, the disclosed method enables one tomake aneuploidy calls on each chromosome of each blastomere, givenmultiple blastomeres with measurements at some loci on all chromosomes,where it is not known how many copies of each chromosome there are. Inthis embodiment, the a MAP estimation is used to classify the copynumber N_(j) of chromosome where jϵ{1, 2 . . . 22, X, Y}, from among thechoices {0, 1, 2, 3} given the measurements D, which includes bothgenotyping information of the blastomeres and the parents. To begeneral, let jϵ{1, 2 . . . m} where m is the number of chromosomes ofinterest; m=24 implies that all chromosomes are of interest. Formally,this looks like:

${\hat{n}}_{j} = {\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; P\; \left( {n_{j},D} \right)}$

Unfortunately, it is not possible to calculate this probability, becausethe probability depends on an unknown random variable Q that describesthe amplification factor of MDA. If the distribution f on Q were known,then it would be possible to solve the following:

${\hat{n}}_{j} = {\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q){P\left( {n_{j},{DQ}} \right)}{dQ}}}}$

In practice, a continuous distribution on Q is not known. However,identifying Q to within a power of two is sufficient, and in practice aprobability mass function (pmf) on Q that is uniform on say {2¹, 2² . .. , 2⁴⁰} can be used. In the development that follows, the integral signwill be used as though a probability distribution function (pdf) on Qwere known, even though in practice a uniform pmf on a handful ofexponential values of Q will be substituted.

This discussion will use the following notation and definitions:

N_(α) is the copy number of autosomal chromosome α, where αϵ{1, 2 . . .22}. It is a random variable. n_(α) denotes a potential value for N_(α).

N_(X) is the copy number of chromosome X. n_(X) denotes a potentialvalue for N_(X).

N_(j) is the copy number of chromosome-j, where for the purposes herejϵ{1, 2 . . . m}. n_(j) denotes a potential value for N_(j).

m is the number of chromosomes of interest, m=24 when all chromosomesare of interest.

H is the set of aneuploidy states. hϵH. For the purposes of thisderivation, let H={paternal monosomy, maternal monosomy, disomy, t1paternal trisomy, t2 paternal trisomy, t1 maternal trisomy, t2 maternaltrisomy}. Paternal monosomy means the only existing chromosome came fromthe father; paternal trisomy means there is one additional chromosomecoming from father. Type 1 (t1) paternal trisomy is such that the twopaternal chromosomes are sister chromosomes (exact copy of each other)except in case of crossover, when a section of the two chromosomes arethe exact copies. Type 2 (t2) paternal trisomy is such that the twopaternal chromosomes are complementary chromosomes (independentchromosomes coming from two grandparents). The same definitions apply tothe maternal monosomy and maternal trisomies.

D is the set of all measurements including measurements on embryo D_(E)and on parents D_(F),D_(M). In the case where these are TAQMANmeasurements on all chromosomes, one can say: D={D₁, D₂ . . . D′_(m)},D_(E)={D_(E,1), D_(E,2) d . . . D_(E,m)} where D_(k)=(D_(E,k), D_(F,k),D_(M,k)) D_(Ej)={t_(E),ij^(A),t_(E)ij^(C)} is the set of TAQMANmeasurements on chromosome j.

t_(E), ij^(A) is the ct value on channel-A of locus i of chromosome-j.Similarly, t_(b)ij^(C) is the ct value on channel-C of locus i ofchromosome-j. (A is just a logical name and denotes the major allelevalue at the locus, while C denotes the minor allele value at thelocus.)

Q represents a unit-amount of genetic material after MDA of singlecell's genomic DNA such that, if the copy number of chromosome-j isn_(j), then the total amount of genetic material at any locus ofchromosome-j is n₁Q. For example, under trisomy, if a locus were AAC,then the amount of A-material at this locus is 2Q, the amount ofC-material at this locus is Q, and the total combined amount of geneticmaterial at this locus is 3Q.

q is the number of numerical steps that will be considered for the valueof Q.

N is the number of SNPs per chromosome that will be measured.

(n^(A),n^(C)) denotes an unordered allele patterns at a locus when thecopy number for the associated chromosome is n. n^(A) is the number oftimes allele A appears on the locus and n^(C) is the number of timesallele C appears on the locus. Each can take on values in 0, . . . , n,and it must be the case that n^(A)+n^(C)=n. For example, under trisomy,the set of allele patterns is {(0,3),(1,2),(2,1),(3,0)}. The allelepattern (2,1) for example corresponds to a locus value of A²C, i.e.,that two chromosomes have allele value A and the third has an allelevalue of C at the locus. Under disomy, the set of allele patterns is{(0,2),(1,1),(2,0)}. Under monosomy, the set of allele patterns is{(0,1),(1,0)}.

QT is the (known) threshold value from the fundamental TAQMAN equationQ₀2^(βt)=Q_(T).

β is the (known) doubling-rate from the fundamental TAQMAN equationQ₀2^(βt)=Q_(T).

⊥ (pronounced “bottom”) is the ct value that is interpreted as meaning“no signal”.

f_(Z)(x) is the standard normal Gaussian pdf evaluated at x.

σ is the (known) standard deviation of the noise on TAQMAN ct values.

MAP Solution

In the solution below, the following assumptions are made:

N_(j)s are independent of one another.

Allele values on neighboring loci are independent.

The goal is to classify the copy number of a designated chromosome. Forinstance, the MAP solution for chromosome a is given by

${\hat{n}}_{j} = {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q){P\left( {n_{j},{DQ}} \right)}{dQ}}}} = {\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q){\sum\limits_{n_{1} \in {\{{1,2,3}\}}}{\ldots \mspace{14mu} {\sum\limits_{n_{j - 1} \in {\{{1,2,3}\}}}{\sum\limits_{n_{j + 1} \in {\{{1,2,3}\}}}\mspace{14mu} {\ldots {\quad{{\sum\limits_{n_{m} \in {\{{1,2,3}\}}}{{P\left( {\left. n\downarrow \right.,{\ldots \mspace{14mu} n_{m}},{DQ}} \right)}{dQ}}} = {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q){\sum\limits_{n_{1} \in {\{{1,2,3}\}}}\mspace{14mu} {\ldots {\sum\limits_{n_{j - 1} \in {\{{1,2,3}\}}}{\sum\limits_{n_{j + 1} \in {\{{1,2,3}\}}}\mspace{14mu} {\ldots \mspace{14mu} {\sum\limits_{n_{m} \in {\{{1,2,3}\}}}{\prod\limits_{k = 1}^{m}\; {{P\left( n_{k} \right)}{P\left( {{D_{k}Q},n_{k}} \right)}{dQ}}}}}}}}}}}} = \; {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q)\left( {{P\left( n_{j} \right)}P\left( {{D_{j}Q},n_{j}} \right)} \right)\left( {\prod\limits_{k \neq j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{P\left( {{D_{k}Q},n_{k}} \right)}}}} \right){dQ}}}} = \mspace{11mu} {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}{\int{f\; (Q)\left( {{P\left( n_{j} \right)}{\sum\limits_{h \in H}{{P\left( {{D_{j}Q},n_{j},h} \right)}{P\left( {hn_{j}} \right)}}}} \right)\left( {\prod\limits_{k + j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{\sum\limits_{h \in H}{{P\left( {{D_{k}Q},n_{k},h} \right)}{P\left( {hn_{k}} \right)}}}}}} \right){dQ}}}} = {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q)\left( {{P\left( n_{j} \right)}{\sum\limits_{h \in H}{{P\left( {hn_{j}} \right)}{\prod\limits_{i}{P\left( {t_{E,{ij}}^{A},t_{E,{ij}}^{C},{{D_{F,{ij}}D_{M,{ij}}}Q},n_{j},h} \right)}}}}} \right) \times \left( {\prod\limits_{k \neq j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{\sum\limits_{h \in H}{{P\left( {hn_{k}} \right)}{\prod\limits_{i}{P\left( {t_{E,{ij}}^{A},t_{E,{ij}}^{C},{{D_{F,{ij}}D_{M,{ij}}}Q},n_{k},h} \right)}}}}}}} \right){dQ}}}} = {{\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q)\left( {P\left( n_{j} \right){\sum\limits_{h \in H}{{P\left( {hn_{j}} \right)}{\prod\limits_{i}{\sum\limits_{\underset{{n_{M}^{A} + n_{M}^{C}} = 2}{{n_{F}^{A} + n_{F}^{C}} = 2}}{{P\left( {n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}P\left( {t_{E,{ij}}^{A},t_{E,{ij}}^{C},{{D_{F,{ij}}D_{M,{ij}}}Q},n_{j},h,n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}}}}}} \right) \times \left( {\prod\limits_{k \neq j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{\sum\limits_{h \in H}{{P\left( {hn_{k}} \right)}{\prod\limits_{i}{\sum\limits_{\underset{{n_{M}^{A} + n_{M}^{C}} = 2}{{n_{F}^{A} + n_{F}^{C}} = 2}}{{P\left( {n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}{P\left( {t_{E,{ik}}^{A},t_{E,{ik}}^{C},{{D_{F,{ik}}D_{M,{ik}}}Q},n_{k},h,n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}}}}}}}}} \right){dQ}}}} = {\underset{n_{1} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; \left( Q \right\} \left( {{P\left( n_{j} \right)}{\sum\limits_{h \in H}{{P\left( {hn_{j}} \right)}{\prod\limits_{i}{\sum\limits_{\underset{{n_{M}^{A} + n_{M}^{C}} = 2}{{n_{F}^{A} + n_{F}^{C}} = 2}}{{P\left( {n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}{P\left( {t_{f,{ij}}^{A}{n_{F}^{A}Q^{\prime}}} \right)}{P\left( {t_{F,{ij}}^{C}{n_{F}^{C}Q^{\prime}}} \right)}{P\left( {t_{M,{ij}}^{A}{n_{M}^{A}Q^{\prime}}} \right)}{P\left( {t_{M,{ij}}^{C}{n_{F}^{C}Q^{\prime}}} \right)} \times {\sum\limits_{{n^{A} + n^{C}} = n_{j}}{{P\left( {n^{A},{n^{C}n_{j}},h,n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}{P\left( {{t_{E,{ij}}^{A}Q},n^{A}} \right)}{P\left( {{t_{E,{ij}}^{C}Q},n^{C}} \right)}}}}}}}}} \right) \times \left( {\prod\limits_{k \neq j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{\sum\limits_{h \in H}{{P\left( {hn_{k}} \right)}{\prod\limits_{i}{\sum\limits_{\underset{{n_{M}^{A} + n_{M}^{C}} = 2}{{n_{F}^{A} + n_{F}^{C}} = 2}}{{P\left( {n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}{P\left( {\tau_{F,{ik}}^{A}{n_{F}^{A}Q^{\prime}}} \right)}{P\left( {\tau_{F,{jk}}^{C}{n_{F}^{C}Q^{\prime}}} \right)}{P\left( {\tau_{m,{ik}}^{C}{n_{M}^{C}Q^{\prime}}} \right)}{P\left( {\tau_{m,{ik}}^{C}{n_{M}^{C}Q^{\prime}}} \right)} \times {\sum\limits_{{n^{A} + n^{C}} = n_{1}}{{P\left( {n^{A},{n^{C}n_{k}},h,n_{F}^{A},n_{F}^{C},n_{M}^{A},n_{M}^{C}} \right)}{P\left( {\tau_{E,{ik}}^{A}{n^{A}Q}} \right)}{P\left( {t_{E,{ik}}^{C}{n^{C}Q}} \right)}}}}}}}}}}} \right){{dQ}\left( {}^{*} \right)}}}}}}}}}}}}}}}}}}}}$

Here it is assumed that Q′, the Q are known exactly for the parentaldata.

Copy Number Prior Probability

Equation (*) depends on being able to calculate values for P(n_(α)) andP(n_(X)), the distribution of prior probabilities of chromosome copynumber, which is different depending on whether it is an autosomalchromosome or chromosome X. If these numbers are readily available foreach chromosome, they may be used as is. If they are not available forall chromosomes, or are not reliable, some distributions may be assumed.Let the prior probability P(n_(a), =1)=P(n_(a), =2)=P(n_(a), =3)=⅓ forautosomal chromosomes, let the probability of sex chromosomes being XYor XX be ½. P(n_(x)=0)=⅓×¼= 1/12. P(n_(x)=1)=⅓×¾+⅓×½+⅓×½×¼= 11/24=0.458,where ¾ is the probability of the monosomic chromosome being X (asoppose to Y), ½ is the probability of being XX for two chromosomes and ¼is the probability of the third chromosome being Y.P(n_(x)=3)=⅓×½×¾=⅛=0.125, where ½ is the probability of being XX for twochromosomes and ¾ is the probability of the third chromosome being X.P(n_(x)=2)=1−P(n_(x)=0)−P(n_(x)=1) P(n_(x)=3)= 4/12=0.333.

Aneuploidy State Prior Probability

Equation (*) depends on being able to calculate values for P(h|n₁), andthese are shown in Table 6. The symbols used in the Table 6 areexplained below

Symbol Meaning Ppm paternal monosomy probability Pmm maternal monosomyprobability Ppt paternal trisomy probability given trisomy Pmt maternaltrisomy probability given trisomy pt1 probability of type 1 trisomy forpaternal trisomy, or P(type 1|paternal trisomy) pt2 probability of type2 trisomy for paternal trisomy, or P(type 2|paternal trisomy) mt1probability of type 1 trisomy for maternal trisomy, or P(type 1|maternaltrisomy) mt2 probability of type 2 trisomy for maternal trisomy, orP(type 2|maternal trisomy)Note that there are many other ways that one skilled in the art, afterreading this disclosure, could assign or estimate appropriate priorprobabilities without changing the essential concept of the patent.Allele Distribution Model without Parents

Equation (*) depends on being able to calculate values forp(n^(A),n^(C)ln_(α),i) and P(n^(A),n^(C)ln_(X),i). These values may becalculated by assuming that the allele pattern (n^(A),n^(C)) is drawni.i.d according to the allele frequencies for its letters at locus i. Anillustrative example is given here. Calculate P((2,1)|n₇=3) under theassumption that the allele frequency for A is 60%, and the minor allelefrequency for C is 40%. (As an aside, note that P((2,1)|n₇=2)=0, sincein this case the pair must sum to 2.) This probability is given by

${P\left( {{\left( {2,1} \right)n_{7}} = 3} \right)} = {\begin{pmatrix}3 \\2\end{pmatrix}({.60})^{2}({.40})}$

The general equation is

${P\left( {n^{A},{n^{C}n_{j}},i} \right)} = {\begin{pmatrix}n \\n^{A}\end{pmatrix}\left( {1 - p_{ij}} \right)^{n^{A}}\left( p_{ij} \right)^{n^{C}}}$

Where p_(t)s is the minor allele frequency at locus i of chromosome j.

Allele Distribution Model Incorporating Parental Genotypes

Equation (*) depends on being able to calculate values forp(n^(A),n^(C)In_(j),h,T_(P,ij)T_(M,ij)) which are listed in Table 7. Ina real situation, LDO will be known in either one of the parents, andthe table would need to be augmented. If LDO are known in both parents,one can use the model described in the Allele Distribution Model withoutParents section.

Population Frequency for Parental Truth

Equation (*) depends on being able to calculate p(T_(FAJ)T_(MAJ)). Theprobabilities of the combinations of parental genotypes can becalculated based on the population frequencies. For example,P(AA,AA)=P(A)⁴, and P(AC,AC)=P_(heteroz) ² where P_(heteroz)=2P(A)P(C)is the probability of a diploid sample to be heterozygous at one locusi.

Error Model

Equation (*) depends on being able to calculate values for P(t^(A)|Q,n⁴)and P(t^(C)|Q,n^(C)). For this an error model is needed. One may use thefollowing error model:

${P\left( {{t^{A}Q},n^{A}} \right)} = \begin{Bmatrix}p_{d} & {t^{A} = {\bot{{and}\mspace{14mu} n^{A}} > 0}} \\{\left( {1 - p_{d}} \right){{fz}\left( {\frac{1}{\sigma}\left( {t^{A} - {\frac{1}{\beta}\log \; \frac{Q_{T}}{n^{A}Q}}} \right)} \right)}} & {t^{A} \neq \bot{{and}\mspace{14mu} n^{A}} > 0} \\{1 - p_{a}} & {t^{A} = {{\bot{{and}\mspace{14mu} n^{A}}} = 0}} \\{2p_{a}{f_{Z}\left( {\frac{1}{\sigma}\left( {{t^{A} -}\bot} \right)} \right)}} & {{t^{A} \neq \bot{{and}\mspace{14mu} n^{A}}} = 0}\end{Bmatrix}$

This error model is used elsewhere in this disclosure, and the fourcases mentioned above are described there. The computationalconsiderations of carrying out the MAP estimation mathematics can becarried out by brute-force are also described in the same section.

Computational Complexity Estimation

Rewrite the equation (*) as follows,

${\hat{n}}_{j} = {\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}\; {\int{f\; (Q)\; \begin{pmatrix}{{P\left( n_{j} \right)}{\prod\limits_{i}{\sum\limits_{{n^{A} + n^{C}} = n_{j}}{P\left( {n^{A},{n^{C}n_{j}},i} \right)}}}} \\{{P\left( {{t_{i,j}^{A}Q},n^{A}} \right)}{P\left( {{t_{i,j}^{C}Q},n^{C}} \right)}}\end{pmatrix} \times \begin{pmatrix}{\prod\limits_{k \neq j}{\sum\limits_{n_{k} \in {\{{1,2,3}\}}}{{P\left( n_{k} \right)}{\prod\limits_{i}\sum\limits_{{n^{A} + n^{C}} = n_{k}}}}}} \\{{P\left( {n^{A},{n^{C}n_{k}},i} \right)}{P\left( {{t_{i,k}^{A}Q},n^{A}} \right)}{P\left( {{t_{i,k}^{C}Q},n^{C}} \right)}}\end{pmatrix}{{dQ}\left( {}^{*} \right)}}}}$

Let the computation time for P(n^(A),n^(C)In_(j),i) be t_(x), that forP(t_(i,j) ^(A)|Q,n^(A)) or p(t_(i,j) ^(C)|Q,n^(C)) be t_(y). Note thatp(n^(A),n^(C)|n_(j),i) may be pre-computed, since their values don'tvary from experiment to experiment. For the discussion here, call acomplete 23-chromosome aneuploidy screen an “experiment”. Computation ofΠΣ_(nA+nC=nj)P(n^(A),n^(C)|n_(j,i))P(t_(i,j) ^(A)|Q,n^(A))P(t_(i,j)^(C)|Q,n^(C)) for 23 chromosomes takesif n_(j=1), (2+t_(x)+2*t_(y))*2N*mif n_(j)=2, (2+t_(x)+2*t_(y)d)*3N*mif n_(j=3), (2+t_(x)+2*t_(y))*4N*mThe unit of time here is the time for a multiplication or an addition.In total, it takes (2+t_(x)+2*t_(y))*9N*m

Once these building blocks are computed, the overall integral may becalculated, which takes time on the order of (2+t_(x)+2*t_(y))*9N*m*q.In the end, it takes 2*m comparisons to determine the best estimate forn₁. Therefore, overall the computational complexity is O(N*m*q).

What follows is a re-derivation of the original derivation, with aslight difference in emphasis in order to elucidate the programming ofthe math. Note that the variable D below is not really a variable. It isalways a constant set to the value of the data set actually in question,so it does not introduce another array dimension when representing inMATLAB. However, the variables D₁ do introduce an array dimension, dueto the presence of the index j.

$\mspace{79mu} {{\hat{n}}_{j} = {\underset{n_{j} \in {\{{1,2,3}\}}}{argmax}{P\left( {n_{j},D} \right)}}}$$\mspace{79mu} {{P\left( {n_{j},D} \right)} = {\sum\limits_{Q}{P\left( {n_{j},Q,D} \right)}}}$     P(n_(j), Q, D) = P(n_(j))P(Q)P(D_(j)Q, n_(j))P(D_(k ≠ j)Q)$\mspace{79mu} {{{P\left( {D_{j}Q} \right)} - {\sum\limits_{n_{j} \in {\{{1,2,3}\}}}{{P\left( {D_{j},{n_{j}Q}} \right)}\mspace{79mu} {P\left( {D_{j},{n_{j}Q}} \right)}}}} = {{{P\left( n_{j} \right)}{P\left( {{D_{j}Q},n_{j}} \right)}\mspace{79mu} {P\left( {{D_{j}Q},n_{j}} \right)}} = {{\prod\limits_{i}\; {{P\left( {{D_{ij}Q},n_{j}} \right)}\mspace{79mu} {P\left( {{D_{ij}Q},n_{j}} \right)}}} = {{\sum\limits_{{n^{A} + n^{C}} = n_{j}}{{P\left( {D_{ij},n^{A},{n^{C}Q},n_{j}} \right)}\mspace{79mu} {P\begin{pmatrix}{D_{ij},n^{A},} \\{{n^{C}Q},n_{j}}\end{pmatrix}}}} = {{P\left( {n^{A},{n^{C}n_{j}},i} \right)}{P\left( {{t_{ij}^{A}Q},n^{A}} \right)}{P\left( {{t_{ij}^{C}Q},n^{C}} \right)}}}}}}$$\mspace{79mu} {{P\left( {n^{A},{n^{C}n_{j}},i} \right)} = {{\begin{pmatrix}n \\n^{A}\end{pmatrix}\left( {1 - p_{ij}} \right)^{n^{A}}\left( p_{ij} \right)^{n^{C}}{P\begin{pmatrix}{{t_{ij}^{A}Q},} \\n^{A}\end{pmatrix}}} = \begin{Bmatrix}p_{d} & {t^{A} = {\bot{{and}\mspace{14mu} n^{A}} > 0}} \\{\left( {1 - p_{d}} \right){{fz}\left( {\frac{1}{\sigma}\left( {t^{A} - {\frac{1}{\beta}\log \; \frac{Q_{T}}{n^{A}Q}}} \right)} \right)}} & {t^{A} \neq \bot{{and}\mspace{14mu} n^{A}} > 0} \\{1 - p_{a}} & {t^{A} = {{\bot{{and}\mspace{14mu} n^{A}}} = 0}} \\{2p_{z}{f_{z}\left( {\frac{1}{\sigma}\left( {{t^{A} -}\bot} \right)} \right)}} & {{t^{A} \neq \bot{{and}\mspace{14mu} n^{A}}} = 0}\end{Bmatrix}}}$

F Qualitative Chromosome Copy Number Calling

One way to determine the number of copies of a chromosome segment in thegenome of a target individual by incorporating the genetic data of thetarget individual and related individual(s) into a hypothesis, andcalculating the most likely hypothesis is described here. In oneembodiment of the invention, the aneuploidy calling method may bemodified to use purely qualitative data. There are many approaches tosolving this problem, and several of them are presented here. It shouldbe obvious to one skilled in the art how to use other methods toaccomplish the same end, and these will not change the essence of thedisclosure.

Notation for Qualitative CNC

1. N is the total number of SNPs on the chromosome.

2. n is the chromosome copy number.

3. n^(m) is the number of copies supplied to the embryo by the mother:0, 1, or 2.

4. n^(F) is the number of copies supplied to the embryo by the father:0, 1, or 2.

5. p_(d) is the dropout rate, and f(p_(d)) is a prior on this rate.

6. p_(a) is dropin rate, and f(p_(a)) is a prior on this rate.

7. c is the cutoff threshold for no-calls.

8. D=(x_(k),y_(k)) is the platform response on channels X and Y for SNPk.

9. D(c)={G(x_(k),y_(k));c}={ĝk^((c))} is the set of genotype calls onthe chromosome. Note that the genotype calls depend on the no-callcutoff threshold c.

10. ĝ_(k) ^((c)) is the genotype call on the k-th SNP (as opposed to thetrue value): one of AA, AB, BB, or NC (no-call).

11. Given a genotype call ĝ at SNP k, the variables (ĝ_(X), ĝ_(Y)) areindicator variables (1 or 0), indicating whether the genotype ĝ impliesthat channel X or Y has “lit up”. Formally, ĝ_(x)=1 just in case ĝcontains the allele A, and ĝ_(Y)=1 just in case contains the allele B.

12. M={g_(k) ^(M)} is the known true sequence of genotype calls on themother. g^(M) refers to the genotype value at some particular locus.

13. F={g_(k) ^(F)} is the known true sequence of genotype calls on thefather. g^(F) refers to the genotype value at some particular locus.

14. n^(A),n^(B) are the true number of copies of A and B on the embryo(implicitly at locus k), respectively. Values must be in {0,1,2,3,4}.

15. cm^(A),cm^(B) are the number of A alleles and B alleles respectivelysupplied by the mother to the embryo (implicitly at locus k). The valuesmust be in {0, 1, 2}, and must not sum to more than 2. Similarly, c_(F)^(A),c_(F) ^(B) are the number of A alleles and B alleles respectivelysupplied by the father to the embryo (implicitly at locus k).Altogether, these four values exactly determine the true genotype of theembryo. For example, if the values were (1,0) and (1,1), then the embryowould have type AAB.

Solution 1: Integrate Over Dropout and Dropin Rates.

In the embodiment of the invention described here, the solution appliesto just a single chromosome. In reality, there is loose coupling amongall chromosomes to help decide on dropout rate p_(d), but the math ispresented here for just a single chromosome. It should be obvious to oneskilled in the art how one could perform this integral over fewer, more,or different parameters that vary from one experiment to another. Itshould also be obvious to one skilled in the art how to apply thismethod to handle multiple chromosomes at a time, while integrating overADO and ADI. Further details are given in Solution 3B below.

$\mspace{79mu} {{P\left( {{n{D(c)}},M,F} \right)} = {\sum\limits_{{({n^{M},n^{F}})} \in n}^{\;}{P\left( {n^{M},{n^{F}{D(c)}},M,F} \right)}}}$$\mspace{79mu} {{P\begin{pmatrix}{n^{M},{n^{F}{D(c)}},} \\{M,F}\end{pmatrix}} = \frac{\begin{matrix}{{P\left( n^{M} \right)}{P\left( n^{F} \right)}} \\{P\left( {{{D(c)}n^{M}},n^{F},M,F} \right)}\end{matrix}}{\begin{matrix}{\sum\limits_{({n^{M},n^{F}})}{{P\left( n^{M} \right)}{P\left( n^{F} \right)}}} \\{P\left( {{{D(c)}n^{M}},n^{F},M,F} \right)}\end{matrix}}}$ $\mspace{79mu} {{P\begin{pmatrix}{{{D(c)}n^{M}},} \\{n^{F},M,F}\end{pmatrix}} = {\int{\int{{f\left( p_{d} \right)}{f\left( p_{a} \right)}{P\begin{pmatrix}{{{D(c)}n^{M}},n^{F},} \\{M,F,p_{d},p_{a}}\end{pmatrix}}{dp}_{d}{dp}_{a}}}}}$ ${P\begin{pmatrix}{{{D(c)}n^{M}},n^{F},} \\{M,F,p_{d},p_{a}}\end{pmatrix}} = {{\prod\limits_{k}\; {P\left( {{{G\left( {x_{k},{y_{k};c}} \right)}n^{M}},n^{F},g_{k}^{M},g_{k}^{F},p_{d},p_{a}} \right)}} = {{\prod\limits_{\underset{\underset{\hat{g} \in {\{{{AA},{AB},{NC}}\}}}{g^{F} \in {\{{{AA},{AB},{BB}}\}}}}{g^{M} \in {\{{{AA},{AB},{BB}}\}}}}\; {\prod\limits_{\{{{{k:\; g_{k}^{M}} = g^{M}},{g_{k}^{F} = g^{F}},{{\overset{\sim}{g}}_{k}^{(c)} = \overset{\sim}{g}}}\}}\; {P\left( {{\hat{g}n^{M}},n^{F},g^{M},g^{F},p_{d},p_{a}} \right)}}} = {\prod\limits_{\underset{\underset{\hat{g} \in {\{{{AA},{AB},{BB},{NC}}\}}}{g^{F} \in {\{{{AA},{AB},{BB}}\}}}}{g^{M} \in {\{{{AA},{AB},{BB}}\}}}}{P\begin{pmatrix}{{\hat{g}n^{M}},n^{F},g^{M},} \\{g^{F},p_{d},p_{a}}\end{pmatrix}}^{{\{\begin{matrix}{{{k:\; g_{k}^{M}} = g^{M}},} \\{{g_{k}^{F} = g^{F}},{{\text{?}{\hat{g}}_{k}^{(c)}} = {\hat{g}\text{?}}}}\end{matrix}\}}}}}}$ $\mspace{79mu} {\exp \begin{pmatrix}{\sum\limits_{\underset{\underset{\hat{g} \in {\{{{AA},{AB},{BB},{NC}}\}}}{g^{F} \in {\{{{AA},{AB},{BB}}\}}}}{g^{M} \in {\{{{AA},{AB},{BB}}\}}}}^{\;}{{\begin{Bmatrix}{{{k:\; g_{k}^{M}} = g^{M}},} \\{{g_{k}^{F} = g^{F}},{{\text{?}{\hat{g}}_{k}^{(c)}} = {\hat{g}\text{?}}}}\end{Bmatrix}} \times}} \\{\log \; {P\left( {{\hat{g}n^{M}},n^{F},g^{M},g^{F},p_{d},p_{a}} \right)}}\end{pmatrix}}$ $\mspace{79mu} {{P\begin{pmatrix}{{\hat{g}n^{M}},n^{F},g^{M},} \\{g^{F},p_{d},p_{a}}\end{pmatrix}} = {\sum\limits_{n^{A},n^{B}}{\underset{\underset{geneticmodeling}{}}{P\begin{pmatrix}{n^{A},{n^{B}n^{M}},} \\{n^{F},g^{M},g^{F},}\end{pmatrix}}\overset{\overset{{platform}\text{?}{modeling}}{\_}}{\begin{pmatrix}{P\left( {{{\hat{g}}_{X}n^{A}},p_{d},p_{a}} \right)} \\{P\left( {{{\hat{g}}_{Y}n^{B}},p_{d},p_{a}} \right)}\end{pmatrix}}}}}$ $\mspace{79mu} {{P\begin{pmatrix}{{{\hat{g}}_{X}n^{A}},} \\{p_{d},p_{a}}\end{pmatrix}} = \left( {{{\hat{g}}_{X}\begin{pmatrix}{\left( {1 - p_{d}^{n^{A}}} \right) +} \\{\left( {n^{A} = 0} \right)p_{a}}\end{pmatrix}} + {\left( {1 - {\hat{g}}_{X}} \right)\begin{pmatrix}{{\left( {n^{A} > 0} \right)p_{d}^{n^{A}}} +} \\{\left( {n^{A} = 0} \right)\left( {1 - p_{a}} \right)}\end{pmatrix}}} \right)}$?indicates text missing or illegible when filed

The derivation other is the same, except applied to channel Y.

${P\begin{pmatrix}{n^{A},{n^{B}n^{M}},} \\{n^{F},g^{M},g^{F},}\end{pmatrix}} = {\sum\limits_{\underset{{c_{M}^{B} + c_{F}^{B}} = n^{B}}{{c_{M}^{A} + c_{F}^{A}} = n^{A}}}{{P\begin{pmatrix}{c_{M}^{A},{c_{M}^{B}}} \\{n^{M},g^{M}}\end{pmatrix}}{P\begin{pmatrix}{c_{F}^{A},{c_{F}^{B}}} \\{n^{F},g^{F}}\end{pmatrix}}}}$${P\left( {c_{M}^{A},{c_{M}^{B}n^{M}},g^{M}} \right)} = {\left( {{c_{M}^{A} + c_{M}^{B}} = n^{M}} \right)\left\{ \begin{matrix}{\left( {c_{M}^{B} = 0} \right),} & {g^{M} = {AA}} \\{\left( {c_{M}^{A} = 0} \right),} & {g^{M} = {BB}} \\{\frac{1}{n^{M} + 1},} & {g^{M} = {AB}}\end{matrix} \right.}$

The other derivation is the same, except applied to the father.

Solution 2: Use ML to Estimate Optimal Cutoff Threshold c Solution 2,Variation A

$\hat{c} = {\underset{c \in {({0,a}\rbrack}}{argmax}\; {P\left( {{{D(c)}M},F} \right)}}$${P(n)} = {\sum\limits_{{({n^{M},n^{F}})},{\in n}}{P\left( {n^{M},{n^{F}{D\left( \hat{c} \right)}},M,F} \right)}}$

In this embodiment, one first uses the ML estimation to get the bestestimate of the cutoff threshold based on the data, and then use this cto do the standard Bayesian inference as in solution 1. Note that, aswritten, the estimate of c would still involve integrating over alldropout and dropin rates. However, since it is known that the dropoutand dropin parameters tend to peak sharply in probability when they are“tuned” to their proper values with respect to c, one may savecomputation time by doing the following instead:

Solution 2, Variation B

$\hat{c},{\hat{p}}_{d},{{\hat{p}}_{a} = {\underset{c,p_{d},p_{a}}{argmax}\; {f\left( p_{d} \right)}{f\left( p_{a} \right)}{P\left( {{{D(c)}M},F,p_{d},p_{a}} \right)}}}$${P(n)} = {\sum\limits_{{({n^{M},n^{F}})} \in n}{P\left( {n^{M},{n^{F}{D\left( \hat{c} \right)}},M,F,{\hat{p}}_{d},{\hat{p}}_{a}} \right)}}$

In this embodiment, it is not necessary to integrate a second time overthe dropout and dropin parameters. The equation goes over all possibletriples in the first line. In the second line, it just uses the optimaltriple to perform the inference calculation.

Solution 3: Combining Data Across Chromosomes

The data across different chromosomes is conditionally independent giventhe cutoff and dropout/dropin parameters, so one reason to process themtogether is to get better resolution on the cutoff and dropout/dropinparameters, assuming that these are actually constant across allchromosomes (and there is good scientific reason to believe that theyare roughly constant). In one embodiment of the invention, given thisobservation, it is possible to use a simple modification of the methodsin solution 3 above. Rather than independently estimating the cutoff anddropout/dropin parameters on each chromosome, it is possible to estimatethem once using all the chromosomes.

Notation

Since data from all chromosomes is being combined, use the subscript jto denote the j-th chromosome. For example, D₁(c) is the genotype dataon chromosome j using c as the no-call threshold. Similarly, M₁,F₁ arethe genotype data on the parents on chromosome j.

Solution 3, Variation A: Use all Data to Estimate Cutoff Dropout/Dropin

$\hat{c},{\hat{p}}_{d},{{\hat{p}}_{a} = {\underset{c,p_{d},p_{a}}{argmax}\; {f\left( p_{d} \right)}{f\left( p_{a} \right)}{\prod\limits_{j}{P\left( {{{D_{j}(c)}M_{j}},F_{j},p_{d},p_{a}} \right)}}}}$${P\left( n_{j} \right)} = {\sum\limits_{{({n^{M},n^{F}})} \in n_{j}}{P\left( {n^{M},{n^{F}{D_{j}\left( \hat{c} \right)}},M_{j},F_{j},{\hat{p}}_{d},{\hat{p}}_{a}} \right)}}$

Solution 3, Variation B:

Theoretically, this is the optimal estimate for the copy number onchromosome j.

$\left. {\left. {{{\hat{n}}_{j} = {{\underset{n}{argmax}{\sum\limits_{{({n^{M},n^{F}})} \in n}{\int{\int{{f\left( p_{d} \right)}{f\left( p_{a} \right)}{P\left( {D_{j}\left( \hat{c} \right)} \right)}}}}}}n^{M}}},n^{F},M_{j},F_{j},p_{d},p_{a}} \right)\mspace{79mu} {{{\prod\limits_{i \neq j}{P\left( {D_{j}\left( \hat{c} \right)} \right)}}n^{M}},n^{F},M_{i},F_{i},p_{d},p_{a}}} \right){dp}_{d}{dp}_{a}$

Estimating Dropout/Dropin Rates from Known Samples

For the sake of thoroughness, a brief discussion of dropout and dropinrates is given here. Since dropout and dropin rates are so important forthe algorithm, it may be beneficial to analyze data with a known truthmodel to find out what the true dropout/dropin rates are. Note thatthere is no single tree dropout rate: it is a function of the cutoffthreshold. That said, if highly reliable genomic data exists that can beused as a truth model, then it is possible to plot the dropout/dropinrates of MDA experiments as a function of the cutoff-threshold. Here amaximum likelihood estimation is used.

$\hat{c},{\hat{p}}_{d},{{\hat{p}}_{a} = {\underset{c,p_{d},p_{a}}{argmax}\underset{jk}{\Pi}{P\left( {{{\hat{g}}_{jk}^{(c)}g_{jk}},p_{d},p_{a}} \right)}}}$

In the above equation, ĝ_(jk) ^((c)), is the genotype call on SNP k ofchromosome j, using c as the cutoff threshold, while gm, is the truegenotype as determined from a genomic sample. The above equation returnsthe most likely triple of cutoff, dropout, and dropin. It should beobvious to one skilled in the art how one can implement this techniquewithout parent information using prior probabilities associated with thegenotypes of each of the SNPs of the target cell that will not underminethe validity of the work, and this will not change the essence of theinvention.

G Bayesian Plus Sperm Method

Another way to determine the number of copies of a chromosome segment inthe genome of a target individual is described here. In one embodimentof the invention, the genetic data of a sperm from the father andcrossover maps can be used to enhance the methods described herein.Throughout this description, it is assumed that there is a chromosome ofinterest, and all notation is with respect to that chromosome. It isalso assumed that there is a fixed cutoff threshold for genotyping.Previous comments about the impact of cutoff threshold choice apply, butwill not be made explicit here. In order to best phase the embryonicinformation, one should combine data from all blastomeres on multipleembryos simultaneously. Here, for ease of explication, it is assumedthat there is just one embryo with no additional blastomeres. However,the techniques mentioned in various other sections regarding the use ofmultiple blastomeres for allele-calling translate in a straightforwardmanner here.

Notation

1. n is the chromosome copy number.

2. n^(m) is the number of copies supplied to the embryo by the mother:0, 1, or 2.

3. n^(F) is the number of copies supplied to the embryo by the father:0, 1, or 2.

4. p_(d) is the dropout rate, and f(p_(d)) is a prior on this rate.

5. p_(a) is the dropin rate, and f(p_(a)) is a prior on this rate.

6. D={ĝ_(k)} is the set of genotype measurements on the chromosome ofthe embryo. ĝ_(k) is the genotype call on the k-th SNP (as opposed tothe true value): one of AA, AB, BB, or NC (no-call). Note that theembryo may be aneuploid, in which case the true genotype at a SNP maybe, for example, AAB, or even AAAB, but the genotype measurements willalways be one of the four listed. (Note: elsewhere in this disclosure‘B’ has been used to indicate a heterozygous locus. That is not thesense in which it is being used here. Here ‘A’ and ‘B’ are used todenote the two possible allele values that could occur at a given SNP.)

7. M={g_(k) ^(M)} is the known true sequence of genotypes on the mother.g_(k) ^(M) is the genotype value at the k-th SNP.

8. F={g_(k) ^(F)} is the known true sequence of genotypes on the father.g_(k) ^(F) is the genotype value at the k-th SNP.

9. S={ĝ_(k) ^(S)} is the set of genotype measurements on a sperm fromthe father. ĝ_(k) ^(S) is the genotype call at the k-th SNP.

10. (m₁,m₂) is the true but unknown ordered pair of phased haplotypeinformation on the mother. m_(1k) is the allele value at SNP k of thefirst haploid sequence. m_(2k) is the allele value at SNP k of thesecond haploid sequence. (m₁,m₂) EM is used to indicate the set ofphased pairs (m₁,m₂) that are consistent with the known genotype M.Similarly, (m₁,m₂)ϵg_(k) ^(m) is used to indicate the set of phasedpairs that are consistent with the known genotype of the mother at SNPk.

11. (f₁,f₂) is the true but unknown ordered pair of phased haplotypeinformation on the father. f_(1k) is the allele value at SNP k of thefirst haploid sequence. f_(2k) is the allele value at SNP k of thesecond haploid sequence. (f₁,f₂)ϵF is used to indicate the set of phasedpairs (f₁,f₂) that are consistent with the known genotype F. Similarly,(f₁,f₂)ϵg_(k) ^(F) is used to indicate the set of phased pairs that areconsistent with the known genotype of the father at SNP k.

12. s₁ is the true but unknown phased haplotype information on themeasured sperm from the father. s_(1k) is the allele value at SNP k ofthis haploid sequence. It can be guaranteed that this sperm is euploidby measuring several sperm and selecting one that is euploid.

13. χ^(M)={ϕ₁, . . . , ϕ_(nM)} is the multiset of crossover maps thatresulted in maternal contribution to the embryo on this chromosome.Similarly, λ^(F)={θ₁, . . . , θ_(nf)} is the multiset of crossover mapsthat results in paternal contribution to the embryo on this chromosome.Here the possibility that the chromosome may be aneuploid is explicitlymodeled. Each parent can contribute zero, one, or two copies of thechromosome to the embryo. If the chromosome is an autosome, theneuploidy is the case in which each parent contributes exactly one copy,i.e., χ^(M){ϕ₁} and χ^(F)={θ₁}. But euploidy is only one of the 3×3=9possible cases. The remaining eight are all different kinds ofaneuploidy. For example, in the case of maternal trisomy resulting froman M2 copy error, one would have χ^(M)={ϕ₁ϕ₁} and χ^(F)={θ₁}. In thecase of maternal trisomy resulting from an M1 copy error, one would haveχ^(M){(ϕ₁,ϕ₂} and χ^(F)={θ₁}. (χ^(M),χ^(F)) E n will be used to indicatethe set of sub-hypothesis pairs (χ^(N), χ^(F)) that are consistent withthe copy number n. χ_(k) ^(m) will be used to denote {ϕ_(1,k), . . . ,ϕ_(n)M_(k)}, the multiset of crossover map values restricted to the k-thSNP, and similarly for χ^(F). χ_(k) ^(M)(m₁,m₂) is used to mean themultiset of allele values {ϕ₁,k(m₁,m₂), . . . , ϕ_(n)M_(k)(m₁,m₂)}={m_(ϕi,k), . . . , m_(ϕnM,k)}. Keep in mind thatϕ_(l,k)ϵ{1,2}.

14. ψ is the crossover map that resulted in the measured sperm from thefather. Thus s₁=ψ(f₁,f₂). Note that it is not necessary to consider acrossover multiset because it is assumed that the measured sperm iseuploid. ψ_(k) will be used to denote the value of this crossover map atthe k-th SNP.

15. Keeping in mind the previous two definitions, let {e₁ ^(M), . . . ,e_(n) ^(M)M} be the multiset of true but unknown haploid sequencescontributed to the embryo by the mother at this chromosome.Specifically, e₁ ^(M)=ϕ₁ (m₁,m₂), where ϕ₁ is the 1-th element of themultiset χ^(M), and e_(1k) ^(M) is the allele value at the k-th snp.Similarly, let {e₁ ^(F), . . . , e_(nF) ^(F)} be the multiset of truebut unknown haploid sequences contributed to the embryo by the father atthis chromosome. Then e₁ ^(F-θ) ₁ (f₁,f₂), where θ₁ is the 1-th elementof the multiset χ^(F), and f_(1k) ^(M) is the allele value at the k-thSNP. Also, {e₁ ^(M), . . . , e_(nM) ^(M)}=χ^(M)(m₁m₂), and {e₁ ^(F), . .. , e_(nF) ^(F)}=χ^(F)(f₁,f₂) may be written.

16. P(ĝ_(k)|χ_(k) ^(M)(m₁,m₂), X_(k) ^(F)(f₁,f₂),p_(d),p_(c)) denotesthe probability of the genotype measurement on the embryo at SNP k givena hypothesized true underlying genotype on the embryo and givenhypothesized underlying dropout and dropin rates. Note that χ^(M)(m₁,m₂)and χ_(k) ^(F)(f₁,f₂) are both multisets, so are capable of expressinganeuploid genotypes. For example, χ_(k) ^(M)(m₁,m₂)={A,A} and χ_(k)^(F)(f₁,f₂)={B} expresses the maternal trisomic genotype AAB.

Note that in this method, the measurements on the mother and father aretreated as known truth, while in other places in this disclosure theyare treated simply as measurements. Since the measurements on theparents are very precise, treating them as though they are known truthis a reasonable approximation to reality. They are treated as knowntruth here in order to demonstrate how such an assumption is handled,although it should be clear to one skilled in the art how the moreprecise method, used elsewhere in the patent, could equally well beused.

Solution

$\hat{n} = {\underset{n}{argmax}{P\left( {n,D,M,F,S} \right)}}$$\begin{matrix}{{P\left( {n,D,M,F,S} \right)} =} & {{\sum\limits_{{({x^{U},x^{F}})} \in n}{\sum\limits_{\psi}{P\left( {\chi^{M},\chi^{F},\psi,D,M,F,S} \right)}}}} \\{=} & {{\sum\limits_{{({\chi^{M},\chi^{F}})} \in n}{{P\left( \chi^{M} \right)}{P\left( \chi^{F} \right)}{\sum\limits_{\psi}{{P(\psi)}{\int{{f\left( p_{d} \right)}\int}}}}}}} \\ & {{{f\left( p_{a} \right)}\underset{k}{\Pi}{P\left( {{\hat{g}}_{k},g_{k}^{M},g_{k}^{F},{{\hat{g}}_{k}^{S}\chi_{k}^{M}},\chi_{k}^{F},\psi_{k},p_{d},p_{a}} \right)}}} \\ & {{{dp}_{d}{dp}_{a}}} \\{=} & {{\sum\limits_{{({\chi^{M},\chi^{F}})} \in n}{{P\left( \chi^{M} \right)}{P\left( \chi^{F} \right)}{\sum\limits_{\psi}{{P(\psi)}{\int{{f\left( p_{d} \right)}{\int{{f\left( p_{a} \right)} \times}}}}}}}}} \\ & {{\underset{k}{\Pi}{\sum\limits_{{({f_{1},f_{2}})} = 0_{k}^{F}}{{P\left( f_{1} \right)}{P\left( f_{2} \right)}{P\left( {{{\hat{g}}_{k}^{\prime}{\psi_{k}\left( {f_{1},f_{2}} \right)}},p_{d},p_{a}} \right)}}}}} \\ & {{\sum\limits_{{({m_{1},m_{2}})} = 0_{k}^{M}}{{P\left( m_{1} \right)}{P\left( m_{2} \right)}{P\left( {{{\hat{g}}_{k}{\chi_{k}^{M}\left( {m_{1},m_{2}} \right)}},{\chi_{k}^{F}\left( {f_{1},f_{2}} \right)},} \right.}}}}\end{matrix}$

How to calculate each of the probabilities appearing in the lastequation above has been described elsewhere in this disclosure. A methodto calculate each of the probabilities appearing in the last equationabove has also been described elsewhere in this disclosure. Althoughmultiple sperm can be added in order to increase reliability of the copynumber call, in practice one sperm is typically sufficient. Thissolution is computationally tractable for a small number of sperm.

H Simplified Method Using Only Polar Homozygotes

In another embodiment of the invention, a similar method to determinethe number of copies of a chromosome can be implemented using a limitedsubset of SNPs in a simplified approach. The method is purelyqualitative, uses parental data, and focuses exclusively on a subset ofSNPs, the so-called polar homozygotes (described below). Polarhomozygotic denotes the situation in which the mother and father areboth homozygous at a SNP, but the homozygotes are opposite, or differentallele values. Thus, the mother could be AA and the father BB, or viceversa. Since the actual allele values are not important—only theirrelationship to each other, i.e. opposites—the mother's alleles will bereferred to as MM, and the father's as FF. In such a situation, if theembryo is euploid, it must be heterozygous at that allele. However, dueto allele dropouts, a heterozygous SNP in the embryo may not be calledas heterozygous. In fact, given the high rate of dropout associated withsingle cell amplification, it is far more likely to be called as eitherMM or FF, each with equal probability.

In this method, the focus is solely on those loci on a particularchromosome that are polar homozygotes and for which the embryo, which istherefore known to be heterozygous, but is nonetheless calledhomozygous. It is possible to form the statistic |MM|/(|MM|+|FF|), where|MM| is the number of these SNPs that are called MM in the embryo and|FFI is the number of these SNPs that are called FF in the embryo.

Under the hypothesis of euploidy, |MM|)/(|MM|+|FF|) is Gaussian innature, with mean ½ and variance ¼N, where N=(|MM|+|FF|). Therefore thestatistic is completely independent of the dropout rate, or, indeed, ofany other factors. Due to the symmetry of the construction, thedistribution of this statistic under the hypothesis of euploidy isknown.

Under the hypothesis of trisomy, the statistic will not have a mean of½. If, for example, the embryo has MMF trisomy, then the homozygouscalls in the embryo will lean toward MM and away from FF, and viceversa. Note that because only loci where the parents are homozygous areunder consideration, there is no need to distinguish M1 and M2 copyerrors. In all cases, if the mother contributes 2 chromosomes instead of1, they will be MM regardless of the underlying cause, and similarly forthe father. The exact mean under trisomy will depend upon the dropoutrate, p, but in no case will the mean be greater than ⅓, which is thelimit of the mean as p goes to 1. Under monosomy, the mean would beprecisely 0, except for noise induced by allele dropins.

In this embodiment, it is not necessary to model the distribution underaneuploidy, but only to reject the null hypothesis of euploidy, whosedistribution is completely known. Any embryo for which the nullhypothesis cannot be rejected at a predetermined significance levelwould be deemed normal.

In another embodiment of the invention, of the homozygotic loci, thosethat result in no-call (NC) on the embryo contain information, and canbe included in the calculations, yielding more loci for consideration.In another embodiment, those loci that are not homozygotic, but ratherfollow the pattern AAIAB, can also be included in the calculations,yielding more loci for consideration. It should be obvious to oneskilled in the art how to modify the method to include these additionalloci into the calculation.

I Reduction to Practice of the PS Method as Applied to Allele Calling

In order to demonstrate a reduction to practice of the PS method asapplied to cleaning the genetic data of a target individual, and itsassociated allele-call confidences, extensive Monte-Carlo simulationswere run. The PS method's confidence numbers match the observed rate ofcorrect calls in simulation. The details of these simulations are givenin separate documents whose benefits are claimed by this disclosure. Inaddition, this aspect of the PS method has been reduced to practice onreal triad data (a mother, a father and a born child). Results are shownbelow in Table 8. The TAQMAN assay was used to measure single cellgenotype data consisting of diploid measurements of a large buccalsample from the father (columns p₁,p₂), diploid measurements of a buccalsample from the mother (m₁,m₂), haploid measurements on three isolatedsperm from the father (h₁,h₂,h₃), and diploid measurements of foursingle cells from a buccal sample from the born child of the triad. Notethat all diploid data are unordered. All SNPs are from chromosome 7 andwithin 2 megabases of the CFTR gene, in which a defect causes cysticfibrosis.

The goal was to estimate (in E1,E2) the alleles of the child, by runningPS on the measured data from a single child buccal cell (e 1,e₂), whichserved as a proxy for a cell from the embryo of interest. Since nomaternal haplotype sequence was available, the three additional singlecells of the child sample—(b11,b12), (b21,b22), (b22,b23), were used inthe same way that additional blastomeres from other embryos are used toinfer maternal haplotype once the paternal haplotype is determined fromsperm. The true allele values (T1,T2) on the child are determined bytaking three buccal samples of several thousand cells, genotyping themindependently, and only choosing SNPs on which the results wereconcordant across all three samples. This process yielded 94 concordantSNPs. Those loci that had a valid genotype call, according to the ABI7900 reader, on the child cell that represented the embryo, were thenselected. For each of these 69 SNPs, the disclosed method determinedde-noised allele calls on the embryo (E₁,E₂), as well as the confidenceassociated with each genotype call.

Twenty-nine (29%) percent of the 69 raw allele calls in uncleanedgenetic data from the child cell were incorrect (marked with a dash “-”in column e1 and e2, Table 8). Columns (E₁,E₂) show that PS corrected 18of these (as indicated by a box in column E1 and E2, but not in column‘conf’, Table 8), while two remained miscalled (2.9% error rate; markedwith a dash “-” in column ‘conf’, Table 8). Note that the two SNPs thatwere miscalled had low confidences of 53.8% and 74.4%. These lowconfidences indicate that the calls might be incorrect, due either to alack of data or to inconsistent measurements on multiple sperm or“blastomeres.” The confidence in the genotype calls produced is anintegral part of the PS report. Note that this demonstration, whichsought to call the genotype of 69 SNPs on a chromosome, was moredifficult than that encountered in practice, where the genotype at onlyone or two loci will typically be of interest, based on initialscreening of parents' data. In some embodiments, the disclosed methodmay achieve a higher level of accuracy at loci of interest by: i)continuing to measure single sperm until multiple haploid allele callshave been made at the locus of interest; ii) including additionalblastomere measurements; iii) incorporating maternal haploid data fromextruded polar bodies, which are commonly biopsied in pre-implantationgenetic diagnosis today. It should be obvious to one skilled in the artthat there exist other modifications to the method that can alsoincrease the level of accuracy, as well as how to implement these,without changing the essential concept of the disclosure.

J Reduction to Practice of the PS Method as Applied to CallingAneuploidy

To demonstrate the reduction to practice of certain aspects of theinvention disclosed herein, the method was used to call aneuploidy onseveral sets of single cells. In this case, only selected data from thegenotyping platform was used: the genotype information from parents andembryo. A simple genotyping algorithm, called “pie slice”, was used, andit showed itself to be about 99.9% accurate on genomic data. It is lessaccurate on MDA data, due to the noise inherent in MDA. It is moreaccurate when there is a fairly high “dropout” rate in MDA. It alsodepends, crucially, on being able to model the probabilities of variousgenotyping errors in terms of parameters known as dropout rate anddropin rate.

The unknown chromosome copy numbers are inferred because different copynumbers interact differently with the dropout rate, dropin rate, and thegenotyping algorithm. By creating a statistical model that specifies howthe dropout rate, dropin rate, chromosome copy numbers, and genotypecutoff-threshold all interact, it is possible to use standardstatistical inference methods to tease out the unknown chromosome copynumbers.

The method of aneuploidy detection described here is termed qualitativeCNC, or qCNC for short, and employs the basic statistical inferencingmethods of maximum-likelihood estimation, maximum-a-posterioriestimation, and Bayesian inference. The methods are very similar, withslight differences. The methods described here are similar to thosedescribed previously, and are summarized here for the sake ofconvenience.

Maximum Likelihood (ML)

Let X₁, . . . , X_(n)˜f(x;θ). Here the X, are independent, identicallydistributed random variables, drawn according to a probabilitydistribution that belongs to a family of distributions parameterized bythe vector θ. For example, the family of distributions might be thefamily of all Gaussian distributions, in which case θ=(μ, σ) would bethe mean and variance that determine the specific distribution inquestion. The problem is as follows: θ is unknown, and the goal is toget a good estimate of it based solely on the observations of the dataX₁, . . . , X_(n). The maximum likelihood solution is given by

$\hat{\theta} = {\underset{\theta}{argmax}\underset{i}{\Pi}{f\left( {X_{i};\theta} \right)}}$

Maximum A′ Posteriori (MAP) Estimation

Posit a prior distribution f(0) that determines the prior probability ofactually seeing 0 as the parameter, allowing us to write X₁, . . . ,X_(n)f(x|θ). The MAP solution is given by

$\hat{\theta} - {\underset{\theta}{argmax}{f(\theta)}{\underset{i}{\Pi}\left( {f\left( {X_{i}\theta} \right)} \right.}}$

Note that the ML solution is equivalent to the MAP solution with auniform (possibly improper) prior.

Bayesian Inference

Bayesian inference comes into play when θ=(θ₁, . . . , θ_(d)) ismultidimensional but it is only necessary to estimate a subset(typically one) of the parameters θ_(j). In this case, if there is aprior on the parameters, it is possible to integrate out the otherparameters that are not of interest. Without loss of generality, supposethat θ₁ is the parameter for which an estimate is desired. Then theBayesian solution is given by:

${\hat{\theta}}_{1} = {\underset{\theta_{1}}{argmax}{f\left( \theta_{1} \right)}{\int{{f\left( \theta_{2} \right)}\ldots \; {f\left( \theta_{d} \right)}{\underset{i}{\Pi}\left( {{f\left( {X_{i}\theta} \right)}d\; \theta_{2}\ldots \; d\; \theta_{d}} \right.}}}}$

Copy Number Classification

Any one or some combination of the above methods may be used todetermine the copy number count, as well as when making allele callssuch as in the cleaning of embryonic genetic data. In one embodiment,the data may come from INFINIUM platform measurements {(x_(jk),y_(jk))},where x_(jk), is the platform response on channel X to SNP k ofchromosome j, and y_(j)k is the platform response on channel Y to SNP kof chromosome j. The key to the usefulness of this method lies inchoosing the family of distributions from which it is postulated thatthese data are drawn. In one embodiment, that distribution isparameterized by many parameters. These parameters are responsible fordescribing things such as probe efficiency, platform noise. MDAcharacteristics such as dropout, dropin, and overall amplification mean,and, finally, the genetic parameters: the genotypes of the parents, thetrue but unknown genotype of the embryo, and, of course, the parametersof interest: the chromosome copy numbers supplied by the mother andfather to the embryo.

In one embodiment, a good deal of information is discarded before dataprocessing. The advantage of doing this is that it is possible to modelthe data that remains in a more robust manner. Instead of using the rawplatform data {(x_(jk),y_(jk))}, it is possible to pre-process the databy running the genotyping algorithm on the data. This results in a setof genotype calls (y_(jk)), where y_(jk)ϵ{NC,AA,AB,BB}. NC stands for“no-call”. Putting these together into the Bayesian inference paradigmabove yields:

${\hat{n}}_{j}^{M},{{\hat{n}}_{j}^{F} = {\max\limits_{n^{M},n^{F}}{\int{\int{{f\left( p_{d} \right)}{f\left( p_{a} \right)}\underset{k}{\Pi}{P\left( {{g_{jk}n^{M}},n^{F},M_{j},F_{j},p_{d},p_{a}} \right)}{dp}_{d}{dp}_{a}}}}}}$

Explanation of the Notation:

{circumflex over (n)}_(j) ^(N),{circumflex over (n)}_(j) ^(F) are theestimated number of chromosome copies supplied to the embryo by themother and father respectively. These should sum to 2 for the autosomes,in the case of euploidy, i.e., each parent should supply exactly 1chromosome.

p_(d) and p_(a) are the dropout and dropin rates for genotyping,respectively. These reflect some of the modeling assumptions. It isknown that in single-cell amplification, some SNPs “drop out”, which isto say that they are not amplified and, as a consequence, do not show upwhen the SNP genotyping is attempted on the INFINIUM platform. Thisphenomenon is modeled by saying that each allele at each SNP “drops out”independently with probability p_(d) during the MDA phase. Similarly,the platform is not a perfect measurement instrument. Due to measurementnoise, the platform sometimes picks up a ghost signal, which can bemodeled as a probability of dropin that acts independently at each SNPwith probability p_(a).

M_(j),F_(j) are the true genotypes on the mother and fatherrespectively. The true genotypes are not known perfectly, but becauselarge samples from the parents are genotyped, one may make theassumption that the truth on the parents is essentially known.

Probe Modeling

In one embodiment of the invention, platform response models, or errormodels, that vary from one probe to another can be used without changingthe essential nature of the invention. The amplification efficiency anderror rates caused by allele dropouts, allele dropins, or other factors,may vary between different probes. In one embodiment, an errortransition matrix can be made that is particular to a given probe.Platform response models, or error models, can be relevant to aparticular probe or can be parameterized according to the quantitativemeasurements that are performed, so that the response model or errormodel is therefore specific to that particular probe and measurement.

Genotyping

Genotyping also requires an algorithm with some built-in assumptions.Going from a platform response (x,y) to a genotype g requiressignificant calculation. It is essentially requires that the positivequadrant of the x/y plane be divided into those regions where AA, AB,BB, and NC will be called. Furthermore, in the most general case, it maybe useful to have regions where AAA, AAB, etc., could be called fortrisomies.

In one embodiment, use is made of a particular genotyping algorithmcalled the pie-slice algorithm, because it divides the positive quadrantof the x/y plane into three triangles, or “pie slices”. Those (x,y)points that fall in the pie slice that hugs the X axis are called AA,those that fall in the slice that hugs the Y axis are called BB, andthose in the middle slice are called AB. In addition, a small square issuperimposed whose lower-left corner touches the origin. (x,y) pointsfalling in this square are designated NC, because both x and ycomponents have small values and hence are unreliable.

The width of that small square is called the no-call threshold and it isa parameter of the genotyping algorithm. In order for the dropin/dropoutmodel to correctly model the error transition matrix associated with thegenotyping algorithm, the cutoff threshold must be tuned properly. Theerror transition matrix indicates for each true-genotype/called-genotypepair, the probability of seeing the called genotype given the truegenotype. This matrix depends on the dropout rate of the MDA and uponthe no-call threshold set for the genotyping algorithm.

Note that a wide variety of different allele calling, or genotyping,algorithms may be used without changing the fundamental concept of theinvention. For example, the no-call region could be defined by a manydifferent shapes besides a square, such as for example a quarter circle,and the no call thresholds may vary greatly for different genotypingalgorithms.

Results of Aneuploidy Calling Experiments

Presented here are experiments that demonstrate the reduction topractice of the method disclosed herein to correctly call ploidy ofsingle cells. The goal of this demonstration was twofold: first, to showthat the disclosed method correctly calls the cell's ploidy state withhigh confidence using samples with known chromosome copy numbers, botheuploid and aneuploid, as controls, and second to show that the methoddisclosed herein calls the cell's ploidy state with high confidenceusing blastomeres with unknown chromosome copy numbers.

In order to increase confidences, the ILLUMINA INFINIUM II platform,which allows measurement of hundreds of thousands of SNPs was used. Inorder to run this experiment in the context of PGD, the standardINFINIUM II protocol was reduced from three days to 20 hours. Singlecell measurements were compared between the full and acceleratedINFINIUM II protocols, and showed 85% concordance. The acceleratedprotocol showed an increase in locus drop-out (LDO) rate from <1% to5-10%; however, because hundreds of thousands of SNPs are measured andbecause PS accommodates allele dropouts, this increase in LDO rate doesnot have a significant negative impact on the results.

The entire aneuploidy calling method was performed on eightknown-euploid buccal cells isolated from two healthy children fromdifferent families, ten known-trisomic cells isolated from a humanimmortalized trisomic cell line, and six blastomeres with an unknownnumber of chromosomes isolated from three embryos donated to research.Half of each set of cells was analyzed by the accelerated 20-hourprotocol, and the other half by the standard protocol. Note that for theimmortalized trisomic cells, no parent data was available. Consequently,for these cells, a pair of pseudo-parental genomes was generated bydrawing their genotypes from the conditional distribution induced byobservation of a large tissue sample of the trisomic genotype at eachlocus.

Where truth was known, the method correctly called the ploidy state ofeach chromosome in each cell with high confidence. The data aresummarized below in three tables. Each table shows the chromosome numberin the first column, and each pair of color-matched columns representsthe analysis of one cell with the copy number call on the left and theconfidence with which the call is made on the right. Each rowcorresponds to one particular chromosome. Note that these tables containthe ploidy information of the chromosomes in a format that could be usedfor the report that is provided to the doctor to help in thedetermination of which embryos are to be selected for transfer to theprospective mother. (Note ‘1’ may result from both monosomy anduniparental disomy.) Table 9 shows the results for eight known-euploidbuccal cells; all were correctly found to be euploid with highconfidences (>0.99). Table 10 shows the results for ten known-trisomiccells (trisomic at chromosome 21); all were correctly found to betrisomic at chromosome 21 and disomic at all other chromosomes with highconfidences (>0.92). Table 11 shows the results for six blastomeresisolated from three different embryos. While no truth models exist fordonated blastomeres, it is possible to look for concordance betweenblastomeres originating from a single embryo, however, the frequency andcharacteristics of mosaicism in human embryos are not currently known,and thus the presence or lack of concordance between blastomeres from acommon embryo is not necessarily indicative of correct ploidydetermination. The first three blastomeres are from one embryo (e1) andof those, the first two (e1b1 and e1b3) have the same ploidy state atall chromosomes except one. The third cell (e1b6) is complex aneuploid.Both blastomeres from the second embryo were found to be monosomic atall chromosomes. The blastomere from the third embryo was found to becomplex aneuploid. Note that some confidences are below 90%, however, ifthe confidences of all aneuploid hypotheses are combined, allchromosomes are called either euploid or aneuploid with confidenceexceeding 92.8%.

K Laboratory Techniques

There are many techniques available allowing the isolation of cells andDNA fragments for genotyping, as well as for the subsequent genotypingof the DNA. The system and method described here can be applied to anyof these techniques, specifically those involving the isolation of fetalcells or DNA fragments from maternal blood, or blastomeres from embryosin the context of IVF. It can be equally applied to genomic data insilico, i.e. not directly measured from genetic material. In oneembodiment of the system, this data can be acquired as described below.This description of techniques is not meant to be exhaustive, and itshould be clear to one skilled in the arts that there are otherlaboratory techniques that can achieve the same ends.

Isolation of Cells

Adult diploid cells can be obtained from bulk tissue or blood samples.Adult diploid single cells can be obtained from whole blood samplesusing FACS, or fluorescence activated cell sorting. Adult haploid singlesperm cells can also be isolated from a sperm sample using FACS. Adulthaploid single egg cells can be isolated in the context of eggharvesting during IVF procedures.

Isolation of the target single cell blastomeres from human embryos canbe done using techniques common in in vitro fertilization clinics, suchas embryo biopsy. Isolation of target fetal cells in maternal blood canbe accomplished using monoclonal antibodies, or other techniques such asFACS or density gradient centrifugation.

DNA extraction also might entail non-standard methods for thisapplication. Literature reports comparing various methods for DNAextraction have found that in some cases novel protocols, such as theusing the addition of N-lauroylsarcosine, were found to be moreefficient and produce the fewest false positives.

Amplification of Genomic DNA

Amplification of the genome can be accomplished by multiple methodsincluding: ligation-mediated PCR (LM-PCR), degenerate oligonucleotideprimer PCR (DOP-PCR), and multiple displacement amplification (MDA). Ofthe three methods, DOP-PCR reliably produces large quantities of DNAfrom small quantities of DNA, including single copies of chromosomes;this method may be most appropriate for genotyping the parental diploiddata, where data fidelity is critical. MDA is the fastest method,producing hundred-fold amplification of DNA in a few hours; this methodmay be most appropriate for genotyping embryonic cells, or in othersituations where time is of the essence.

Background amplification is a problem for each of these methods, sinceeach method would potentially amplify contaminating DNA. Very tinyquantities of contamination can irreversibly poison the assay and givefalse data. Therefore, it is critical to use clean laboratoryconditions, wherein pre- and post-amplification workflows arecompletely, physically separated. Clean, contamination free workflowsfor DNA amplification are now routine in industrial molecular biology,and simply require careful attention to detail.

Genotyping Assay and Hybridization

The genotyping of the amplified DNA can be done by many methodsincluding MOLECULAR INVERSION PROBES (MIPs) such as AFFYMETRIX's GENFLEXTAG array, microarrays such as AFFYMETRIX's 500K array or the ILLUMINABEAD ARRAYS, or SNP genotyping assays such as APPLIEDBIOSCIENCE's TAQMANassay. The AFFYMETRIX 500K array, MIPs/GENFLEX, TAQMAN and ILLUMINAassay all require microgram quantities of DNA, so genotyping a singlecell with either workflow would require some kind of amplification. Eachof these techniques has various tradeoffs in terms of cost, quality ofdata, quantitative vs. qualitative data, customizability, time tocomplete the assay and the number of measurable SNPs, among others. Anadvantage of the 500K and ILLUMINA arrays are the large number of SNPson which it can gather data, roughly 250,000, as opposed to MIPs whichcan detect on the order of 10,000 SNPs, and the TAQMAN assay which candetect even fewer. An advantage of the MIPs, TAQMAN and ILLUMINA assayover the 500K arrays is that they are inherently customizable, allowingthe user to choose SNPs, whereas the 500K arrays do not permit suchcustomization.

In the context of pre-implantation diagnosis during IVF, the inherenttime limitations are significant; in this case it may be advantageous tosacrifice data quality for turn-around time. Although it has other clearadvantages, the standard MIPs assay protocol is a relativelytime-intensive process that typically takes 2.5 to three days tocomplete. In MIPs, annealing of probes to target DNA andpost-amplification hybridization are particularly time-intensive, andany deviation from these times results in degradation in data quality.Probes anneal overnight (12-16 hours) to DNA sample. Post-amplificationhybridization anneals to the arrays overnight (12-16 hours). A number ofother steps before and after both annealing and amplification bring thetotal standard timeline of the protocol to 2.5 days. Optimization of theMIPs assay for speed could potentially reduce the process to fewer than36 hours. Both the 500K arrays and the ILLUMINA assays have a fasterturnaround: approximately 1.5 to two days to generate highly reliabledata in the standard protocol. Both of these methods are optimizable,and it is estimated that the turn-around time for the genotyping assayfor the 500 k array and/or the ILLUMINA assay could be reduced to lessthan 24 hours. Even faster is the TAQMAN assay which can be run in threehours. For all of these methods, the reduction in assay time will resultin a reduction in data quality, however that is exactly what thedisclosed invention is designed to address.

Naturally, in situations where the timing is critical, such asgenotyping a blastomere during IVF, the faster assays have a clearadvantage over the slower assays, whereas in cases that do not have suchtime pressure, such as when genotyping the parental DNA before IVF hasbeen initiated, other factors will predominate in choosing theappropriate method. For example, another tradeoff that exists from onetechnique to another is one of price versus data quality. It may makesense to use more expensive techniques that give high quality data formeasurements that are more important, and less expensive techniques thatgive lower quality data for measurements where the fidelity is not ascritical. Any techniques which are developed to the point of allowingsufficiently rapid high-throughput genotyping could be used to genotypegenetic material for use with this method.

Methods for Simultaneous Targeted Locus Amplification and Whole GenomeAmplification.

During whole genome amplification of small quantities of geneticmaterial, whether through ligation-mediated PCR (LM-PCR), multipledisplacement amplification (MDA), or other methods, dropouts of locioccur randomly and unavoidably. It is often desirable to amplify thewhole genome nonspecifically, but to ensure that a particular locus isamplified with greater certainty. It is possible to perform simultaneouslocus targeting and whole genome amplification.

In a preferred embodiment, the basis for this method is to combinestandard targeted polymerase chain reaction (PCR) to amplify particularloci of interest with any generalized whole genome amplification method.This may include, but is not limited to: preamplification of particularloci before generalized amplification by MDA or LM-PCR, the addition oftargeted PCR primers to universal primers in the generalized PCR step ofLM-PCR, and the addition of targeted PCR primers to degenerate primersin MDA.

L Techniques for Screening for Aneuploidy using High and MediumThroughput Genotyping

In one embodiment of the system the measured genetic data can be used todetect for the presence of aneuploides and/or mosaicism in anindividual. Disclosed herein are several methods of using medium orhigh-throughput genotyping to detect the number of chromosomes or DNAsegment copy number from amplified or unamplified DNA from tissuesamples. The goal is to estimate the reliability that can be achieved indetecting certain types of aneuploidy and levels of mosaicism usingdifferent quantitative and/or qualitative genotyping platforms such asABI Taqman, MIPS, or Microarrays from Illumina, Agilent and Affymetrix.In many of these cases, the genetic material is amplified by PCR beforehybridization to probes on the genotyping array to detect the presenceof particular alleles. How these assays are used for genotyping isdescribed elsewhere in this disclosure.

Described below are several methods for screening for abnormal numbersof DNA segments, whether arising from deletions, aneuploides and/ormosaicism. The methods are grouped as follows: (i) quantitativetechniques without making allele calls; (ii) qualitative techniques thatleverage allele calls; (iii) quantitative techniques that leverageallele calls; (iv) techniques that use a probability distributionfunction for the amplification of genetic data at each locus. Allmethods involve the measurement of multiple loci on a given segment of agiven chromosome to determine the number of instances of the givensegment in the genome of the target individual. In addition, the methodsinvolve creating a set of one or more hypotheses about the number ofinstances of the given segment; measuring the amount of genetic data atmultiple loci on the given segment; determining the relative probabilityof each of the hypotheses given the measurements of the targetindividual's genetic data; and using the relative probabilitiesassociated with each hypothesis to determine the number of instances ofthe given segment. Furthermore, the methods all involve creating acombined measurement M that is a computed function of the measurementsof the amounts of genetic data at multiple loci. In all the methods,thresholds are determined for the selection of each hypothesis H, basedon the measurement M, and the number of loci to be measured isestimated, in order to have a particular level of false detections ofeach of the hypotheses.

The probability of each hypothesis given the measurement M isP(H_(i)|M), P(M|H_(i))P(H_(i))/P(M). Since P(M) is independent of H_(i),we can determine the relative probability of the hypothesis given M byconsidering only P(M|H_(i))P(H_(i)). In what follows, in order tosimplify the analysis and the comparison of different techniques, weassume that P(H_(i)) is the same for all {H_(i)}, so that we can computethe relative probability of all the P(H_(i)|M) by considering onlyP(M|H_(i)). Consequently, our determination of thresholds and the numberof loci to be measured is based on having particular probabilities ofselecting false hypotheses under the assumption that P(H_(i)) is thesame for all {H_(i)}. It will be clear to one skilled in the art afterreading this disclosure how the approach would be modified toaccommodate the fact that P(H_(i)) varies for different hypotheses inthe set {H_(i)}. In some embodiments, the thresholds are set so thathypothesis H,* is selected which maximizes P(H_(i)|M) over all i.However, thresholds need not necessarily be set to maximize P(H_(i)|M),but rather to achieve a particular ratio of the probability of falsedetections between the different hypotheses in the set {H_(i)}.

It is important to note that the techniques referred to herein fordetecting aneuploides can be equally well used to detect for uniparentaldisomy, unbalanced translocations, and for the sexing of the chromosome(male or female; XY or XX). All of the concepts concern detecting theidentity and number of chromosomes (or segments of chromosomes) presentin a given sample, and thus are all addressed by the methods describedin this document. It should be obvious to one skilled in the art how toextend any of the methods described herein to detect for any of theseabnormalities.

-   The Concept of Matched Filtering

The methods applied here are similar to those applied in optimaldetection of digital signals. It can be shown using the Schwartzinequality that the optimal approach to maximizing Signal to Noise Ratio(SNR) in the presence of normally distributed noise is to build anidealized matching signal, or matched filter, corresponding to each ofthe possible noise-free signals, and to correlate this matched signalwith the received noisy signal. This approach requires that the set ofpossible signals are known as well as the statistical distribution meanand Standard Deviation (SD) of the noise. Herein is described thegeneral approach to detecting whether chromosomes, or segments of DNA,are present or absent in a sample. No differentiation will be madebetween looking for whole chromosomes or looking for chromosome segmentsthat have been inserted or deleted. Both will be referred to as DNAsegments. It should be clear after reading this description how thetechniques may be extended to many scenarios of aneuploidy and sexdetermination, or detecting insertions and deletions in the chromosomesof embryos, fetuses or born children. This approach can be applied to awide range of quantitative and qualitative genotyping platformsincluding Taqman, qPCR, Illumina Arrays, Affymetrix Arrays, AgilentArrays, the MIPS kit etc.

Formulation of the General Problem

Assume that there are probes at SNPs where two allelic variations occur,x and y. At each locus i, i=1 . . . N, data is collected correspondingto the amount of genetic material from the two alleles. In the Taqmanassay, these measures would be, for example, the cycle time, C_(t), atwhich the level of each allele-specific dye crosses a threshold. It willbe clear how this approach can be extended to different measurements ofthe amount of genetic material at each locus or corresponding to eachallele at a locus. Quantitative measurements of the amount of geneticmaterial may be nonlinear, in which case the change in the measurementof a particular locus caused by the presence of the segment of interestwill depend on how many other copies of that locus exist in the samplefrom other DNA segments. In some cases, a technique may require linearmeasurements, such that the change in the measurement of a particularlocus caused by the presence of the segment of interest will not dependon how many other copies of that locus exist in the sample from otherDNA segments. An approach is described for how the measurements from theTaqman or qPCR assays may be linearized, but there are many othertechniques for linearizing nonlinear measurements that may be appliedfor different assays.

The measurements of the amount of genetic material of allele x at loci 1. . . N is given by data d_(x)=[d_(x1) . . . d_(xN)]. Similarly forallele y, d_(y1)=[d_(y1) . . . d_(yN)]. Assume that each segment j hasalleles a_(j)=[a_(j1) . . . a_(jN)] where each element a_(ji) is eitherx or y. Describe the measurement data of the amount of genetic materialof allele x as d_(x)=s_(X)+υ_(x) where s_(x) is the signal and υ_(x) isa disturbance. The signal s_(x)=[f_(x)(a₁₁, . . . , a_(J1)) . . .f_(x)(a_(JN), . . . , a_(JN))] where f_(x) is the mapping from the setof alleles to the measurement, and J is the number of DNA segmentcopies. The disturbance vector υ_(x) is caused by measurement error and,in the case of nonlinear measurements, the presence of other geneticmaterial besides the DNA segment of interest. Assume that measurementerrors are normally distributed and that they are large relative todisturbances caused by nonlinearity (see section on linearizingmeasurements) so that υ_(xi)≈n_(xi) where n_(xi) has variance σ_(xi) ²and vector n_(x) is normally distributed ˜N(0,R), R=E(n_(x)n_(x) ^(T)).Now, assume some filter h is applied to this data to perform themeasurement m_(x)=h^(T)d_(x)=h^(T)s_(x)+h^(T)υ_(x). In order to maximizethe ratio of signal to noise (h^(T)s_(x)/h^(T)n_(x)) it can be shownthat h is given by the matched filter h=μR⁻¹s_(x) where μ is a scalingconstant. The discussion for allele x can be repeated for allele y.

Method 1a: Measuring Aneuploidy or Sex by Quantitative Techniques thatdo not Make Allele Calls when the Mean and Standard Deviation for EachLocus is Known

Assume for this section that the data relates to the amount of geneticmaterial at a locus irrespective of allele value (e.g. using qPCR), orthe data is only for alleles that have 100% penetrance in thepopulation, or that data is combined on multiple alleles at each locus(see section on linearizing measurements)) to measure the amount ofgenetic material at that locus. Consequently, in this section one mayrefer to data d_(x) and ignore d_(y). Assume also that there are twohypotheses: h₀ that there are two copies of the DNA segment (these aretypically not identical copies), and h₁ that there is only 1 copy. Foreach hypothesis, the data may be described asd_(xi)(h₀)=s_(xi)(h₀)+n_(xi) and d_(xi)(h₁)=s_(xi)(h₁)+n_(xi)respectively, where s_(xi)(h₀) is the expected measurement of thegenetic material at locus i (the expected signal) when two DNA segmentsare present and s_(xi)(h₁) is the expected data for one segment.Construct the measurement for each locus by differencing out theexpected signal for hypothesis h₀: m_(xi)=d_(xi)−s_(xi)(h₀). If h₁ istrue, then the expected value of the measurement isE(m_(xi))=s_(xi)(h₁)−s_(xi)(h₀). Using the matched filter conceptdiscussed above, set h=(1/N)R⁻¹(s_(xi)(h₁)−s_(xi)(h₀)). The measurementis described asm=h^(T)d_(x)=(1/N)Σ_(i=1 . . . N)((s_(xi)(h₁)−s_(xi)(h₀))/σ_(xi)²)m_(xi).

If h₁ is true, the expected value ofE(m|h₁)=m₁d=(1/N)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ² andthe standard deviation of m is σ_(m|h1)²=(1/N²)Σ_(i=1 . . . N)((s_(xi)(h_(i))−s_(xi)(h₀)²/σ_(xi) ⁴)σ_(xi)²=(1/N²)/Σ_(i=1 . . . N)(s_(xi)(h_(i))−s_(xi)(h₀))²/σ_(xi) ².

If h₀ is true, the expected value of m is E(m|h₀)=m₀=0 and the standarddeviation of m is again σ_(m|h0)²=(1/N²)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ².

FIG. 1 illustrates how to determine the probability of false negativesand false positive detections. Assume that a threshold t is set half-waybetween m₁ and m₀ in order to make the probability of false negativesand false positives equal (this need not be the case as is describedbelow). The probability of a false negative is determined by the ratioof (m₁-t)/σ_(m|h1)=(m₁−m₀)/(2σ_(m|h1)). “5-Sigma” statistics may be usedso that the probability of false negatives is 1-normcdf(5,0,1)=2.87e-7.In this case, the goal is for (m₁−m₀)/(2σ_(m|h0))>5 or 10sqrt((1/N²)Σ_(i=1 . . . N) (s_(xi)(h₁)−S_(xi)(h₀))²/σ_(xi)²)<(1/N)Σ_(i=1 . . . N) (s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ² orsqrt(Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ²)>10. In order tocompute the size of N, Mean Signal to Noise Ratio can be computed fromaggregated data:MSNR=(1/N)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ². N can thenbe found from the inequality above: sqrt(N)·sqrt(MSNR)>10 or N>100/MSNR.

This approach was applied to data measured with the Taqman Assay fromApplied BioSystems using 48 SNPs on the X chromosome. The measurementfor each locus is the time, Cr, that it takes the die released in thewell corresponding to this locus to exceed a threshold. Sample 0consists of roughly 0.3 ng (50 cells) of total DNA per well of mixedfemale origin where subjects had two X chromosomes; sample 1 consistedof roughly 0.3 ng of DNA per well of mixed male origin where subject hadone X chromosome. FIGS. 2 and 3 show the histograms of measurements forsamples 1 and 0. The distributions for these samples are characterizedby m₀=29.97; SD₀=1.32, m₁=31.44, SD₁=1.592. Since this data is derivedfrom mixed male and female samples, some of the observed SD is due tothe different allele frequencies at each SNP in the mixed samples. Inaddition, some of the observed SD will be due to the varying efficiencyof the different assays at each SNP, and the differing amount of dyepipetted into each well. FIG. 4 provides a histogram of the differencein the measurements at each locus for the male and female sample. Themean difference between the male and female samples is 1.47 and the SDof the difference is 0.99. While this SD will still be subject to thedifferent allele frequencies in the mixed male and female samples, itwill no longer be affected the different efficiencies of each assay ateach locus. Since the goal is to differentiate two measurements eachwith a roughly similar SD, the adjusted SD may be approximated for eachmeasurement for all loci as 0.99/sqrt(2)=0.70. Two runs were conductedfor every locus in order to estimate G, for the assay at that locus sothat a matched filter could be applied. A lower limit of σ_(xi) was setat 0.2 in order to avoid statistical anomalies resulting from only tworuns to compute σ_(xi). Only those loci (numbering 37) for which therewere no allele dropouts over both alleles, over both experiment runs andover both male and female samples were used in the plots andcalculations. Applying the approach above to this data, it was foundthat MSNR=2.26, hence N=2²5²/2.262=17 loci. Although applied here onlyto the X chromosome, and to differentiating 1 copy from 2 copies, thisexperiment indicates the number of loci necessary to detect M2 copyerrors for all chromosomes, where two exact copies of a chromosome occurin a trisomy, using Method 3 described below.

The measurement used for each locus is the cycle number, Ct, that ittakes the die released in the well corresponding to a particular alleleat the given locus to exceed a threshold that is automatically set bythe ABI 7900HT reader based on the noise of the no-template control.Sample 0 consisted of roughly 60 pg (equivalent to genome of 10 cells)of total DNA per well from a female blood sample (XX); sample 1consisted of roughly 60 pg of DNA per well from a male blood sample (X).As expected, the Ct measurement of female samples is on average lowerthan that of male samples.

There are several approaches to comparing the Taqman Assay measurementsquantitatively between female and male samples. Here illustrated is oneapproach. To combine information from the FAM and VIC channel for eachlocus, C_(t) values of the two channels were converted to the copynumbers of their respective alleles, summed, and then converted back toa composite C_(t) value for that locus. The conversion between C_(t)value and the copy number was based on the equation Nc=10^((−a*Ct+b))which is typically used to model the exponential growth of the diemeasurement during real-time PCR. The coefficients a and b weredetermined empirically from the C_(t) values using multiple measurementson quantities of 6 pg and 60 pg of DNA. We determined that a≈0.298,b≈10.493; hence we used the linearizing formulaNc=10^((−0.298Ct+10.493)).

FIG. 14, top panel, shows the means and standard deviations of thedifferences between the composite C_(t) values of male and femalesamples at each of the 19 loci measured. The mean difference between themale and female samples is 1.19 and the SD of the differences is 0.62.Note that this locus-specific SD will not be affected by the differentefficiencies of the assay at each locus. Three runs were conducted forevery locus in order to estimate a_(x), for the assay at that locus sothat a matched filter could be created and applied. Note that a lowerlimit of standard deviation at each locus was set at 0.6 in order toavoid statistical anomalies resulting from the small number of runs ateach locus. Only those loci for which there were no allele dropouts overat least two experiment runs and over both male and female samples wereused in the plots and calculations. Applying the approach above to thisdata, it was found that MSNR=9.7. Hence N=6 loci are required in orderto have 99.99% confidence of the test, assuming those 6 loci generatethe same MSNR (Mean Signal to Noise Ratio) as the 19 loci used in thistest. The combined measure m and its expected standard deviation afterapplying the matched filter for male and female samples is shown in FIG.14, bottom.

The same approach was applied to samples diluted 10 times from the abovementioned DNA. Now each well consisted of roughly 6 pg (equivalentamount of a single cell genome) of total DNA of female and male bloodsamples. As is the previous case, three replicates were tested for eachlocus in order to estimate the mean and standard deviations of thedifference in C_(t) levels between male and female samples, and a lowerlimit of 0.6 was set on the standard deviation at each locus in order toavoid statistical anomalies resulting from the small number of runs ateach locus. Only those loci for which there were no allele dropouts overat least two experiment runs and over both male and female samples wereused in the plots and calculations. This resulted in 13 loci that wereused in creating the matched filter. FIG. 15, with the same lay out asin FIG. 14, shows how the differences between male and femalemeasurements are combined into a single measure. The estimated number,N, of SNPs that need to be measured in order to assure 99.99% confidenceof the test is 11. This assumes that these 11 loci have the same MSNR(Mean Signal to Noise Ration) as the 13 loci tested in this experiment.

Note that this result of 11 loci is significantly lower than we expectto employ in practice. The primary reason is that this experimentperformed an allele-specific amplification in each Taqman well, in which6 pg of DNA is placed. The expected standard deviation for each locus islarger when one initially performs a whole-genome amplification in orderto generate a sufficient quantity of genetic material that can be placedin each well. In a later section, we describe the experiment thataddresses this issue, using Multiple Displacement Amplification (MDA)whole genome amplification of single cells.

A similar approach to that described above was also applied to datameasured with the SYBR qPCR Assay using 20 SNPs on the X chromosome offemale and male blood samples. Again, the measurement for each locus isthe cycle number, C_(t), that it takes the die released in the wellcorresponding to this locus to exceed a threshold. Note that we do notneed to combine measurements from different dyes in this case, sinceonly one dye is used to represent the total amount of genetic materialat a locus, independent of allele value. Sample 0 and 1 consisted ofroughly 60 pg (10 cells) of total DNA per well of female and malesamples, respectively. FIG. 16, top show the means and standarddeviations of differences of C_(t) for male and female samples at eachof the 20 loci. The mean difference between the male and female samplesis 1.03 and the SD of the difference is 0.78. Three runs were conductedfor every locus and only those loci for which there were no alleledropouts over at least two experiment runs and over both male and femalesamples were used. Applying the approach above to this data, it wasfound that N=14 loci in order to have 99.99% confidence of the test,assuming those 14 loci have the same MSNR as the 20 used in thisexperiment.

And again, in order to estimate the number of SNPs needed for singlecell measurements, this technique was applied to samples that consist of6 pg of total DNA of female and male origin, see FIG. 17. The estimatednumber N, of SNPs that need to be measured in order to assure 99.99%confidence in differentiating one chromosome copy from two, is 27,assuming those 27 loci have the same MSNR as the 20 used in thisexperiment. As described above, this result of 27 loci is lower than weexpect to employ in practice, since this experiment uses locus-specificamplification in each qPCR well. We next describe an experiment thataddresses this issue. The experiments discussed hitherto were designedto specifically address locus or allele-specific amplification of smallDNA quantities, without complicating the experiment by introducingissues of cells lysis and whole genome amplification. The quantitativemeasurements were done on small quantities of DNA diluted from a largesample, without using whole genome pre-amplification of single cells.Despite the goal to simplify the experiment, separate dilution andpipetting of the two samples affected the amount of DNA used to comparemale with female samples at each locus. To ameliorate this effect, thediluted concentrations were measured using spectrometer comparisons toDNA with known concentrations and were calibrated appropriately.

We now describe an experiment that employs the protocol that will beused for real aneuploidy screening, including cell lysis and wholegenome amplification. The level of amplification is in excess of 10,000in order to generate sufficient genetic material to populate roughly1,000 Taqman wells with 60 pg of DNA from each cell. In order toestimate the standard deviation of single genotyping assays withwhole-genome pre-amplifications, multiple experiments were conductedwhere a single female HeLa cell (XX) was pre-amplified using MultipleDisplacement Amplification (MDA) and the amounts of DNA at 20 loci onits X chromosome were measured using quantitative PCR assays. Thisexperiment was repeated for 16 single HeLa Cells in 16 separate MDApre-amplifications. The results of the experiment are designed to beconservative, since the standard deviation between loci from separateamplifications will be greater than the standard deviation expectedbetween loci used in the same reaction. In actual implementation, wewill compare loci of chromosomes that were involved in the same MDAamplification. Furthermore, it is conservative since we assume that theC_(t) of a cell with one X chromosome will be one cycle more than thatof the HeLa cell (XX), i.e. we increased the C_(t) measurements of adouble-X cell by 1 to simulate a single-X cell. This is a conservativeestimate because the difference between C_(t) values are typicallygreater than 1 due to inefficiencies of the MDA and PCR assays i.e. witha perfectly efficient PCR reaction, the amount of DNA is doubled in eachcycle. However, the amplification is typically less than a factor of twoin each PCR cycle due to imperfect hybridization and other effects. Thisexperiment was designed to establish an upper limit on the amount ofloci that we will need to measure to screen aneuploidy that involve M2copy errors where quantitative data is necessary. Applying the MatchedFilter technique for this data, as shown in FIG. 18A, the minimum numberof SNPs to achieve 99.99% is estimated as N=131.

To really estimate how many SNPs are required for this approach todifferentiate one or two copies of chromosomes in real aneuploidyscreening, it is highly desirable to test the system on samples with onesample containing twice the amount of genetic material as in the other.This precise control is not easily achieved by sample handling inseparate wells because the dilution, pipetting and/or amplificationefficiency vary from well to well. Here an experiment was designed toovercome these issues by using an internal control, namely by comparingthe amount of genetic material on Chromosome 7 and Chromosome X of amale sample. Multiple experiments were conducted where a single maleMRC-5 cell (X) was pre-amplified using Multiple DisplacementAmplification (MDA) and the amounts of DNA at 11 loci on its Chromosome7 and 13 loci on its Chromosome X were measured using ABI Taqman assays.The difference in numbers of loci on Chromosome 7 and X was chosenbecause larger standard deviation of measured amount of genetic materialis expected for Chromosome X. This experiment was repeated for 15 singleMRC-5 Cells in 15 separate MDA pre-amplifications. After combiningreadouts from both FAM and VIC channels of all loci, we used theaveraged composite Ct values on Chromosome 7 as the reference, whichcorresponds to the “normal” sample referred to in aneuploidy screening.Composite Ct values on Chromosome X were differenced by the reference,and if a significant difference voted by many loci is detected, it thencorresponds to the “aneuploidy” condition. To create a matched filter,standard deviation of these differences at each loci was measured usingthe results of 15 independent single cell experiments. The mean andstandard deviation at each loci was shown in FIG. 18B. And MSNR=0.631.So the number of SNPs to be measured to achieve 99.99% confidence, N=88.Note that this is the upper limit because we are still using singlecells pre-amplified separately.

Method 1b: Measuring Aneuploidy or Sex by Quantitative Techniques thatDo Not Make Allele Calls When the Mean and Std. Deviation is Not Knownor is Uniform

When the characteristics of each locus are not known well, thesimplifying assumptions that all the assays at each locus will behavesimilarly can be made, namely that E(m_(xi)) and σ_(x), are constantacross all loci i, so that it is possible to refer instead only toE(m_(x)) and σ_(x). In this case, the matched filtering approachm=h^(T)d_(x) reduces to finding the mean of the distribution of d_(x).This approach will be referred to as comparison of means, and it will beused to estimate the number of loci required for different kinds ofdetection using real data.

As above, consider the scenario when there are two chromosomes presentin the sample (hypothesis h₀) or one chromosome present (h₁). For h₀,the distribution is N(μ₀,σ₀ ²) and for h₁ the distribution is N(μ₁,σ₁²). Measure each of the distributions using N₀ and N₁ samplesrespectively, with measured sample means and SDs m₁, m₀, s₁, and s₀. Themeans can be modeled as random variables M₀, M₁ that are normallydistributed as M₀˜N(μ₀, σ₀ ²/N₀) and M₁(μ₁, σ₁ ²/N₁). Assume N₁ and N₀are large enough (>30) so that one can assume that M₁˜N(m₁, s₁ ²/N₁) andM₀˜N(m₀, s₀ ²/N₀). In order to test whether the distributions aredifferent, the difference of the means test may be used, where d=m₁−m₀.The variance of the random variable D is σ_(d) ²=σ₁ ²/N₁+σ₀ ²/N₀ whichmay be approximated as σ_(d) ²=s₁ ²/N₁+s₀ ²/N₀. Given h₀, E(d)=0; givenh₁, E(d)=μ₁−μ₁. Different techniques for making the call between h₁ forh₀ will now be discussed.

Data measured with a different run of the Taqman Assay using 48 SNPs onthe X chromosome was used to calibrate performance. Sample 1 consists ofroughly 0.3 ng of DNA per well of mixed male origin containing one Xchromosome; sample 0 consisted of roughly 0.3 ng of DNA per well ofmixed female origin containing two X chromosomes. N₁=42 and N₀=45. FIGS.5 and 6 show the histograms for samples 1 and 0. The distributions forthese samples are characterized by m₁=32.259, s₁=1.460,σm₁=s₁/sqrt(N₁)=0.225; m₀=30.75; s₀=1.202, σ_(m0)=s₀/sqrt(N₀)=0.179. Forthese samples d=1.509 and σ_(d)=0.2879.

Since this data is derived from mixed male and female samples, much ofthe standard deviation is due to the different allele frequencies ateach SNP in the mixed samples. SD is estimated by considering thevariations in C_(t) for one SNP at a time, over multiple runs. This datais shown in FIG. 7. The histogram is symmetric around 0 since C_(t) foreach SNP is measured in two runs or experiments and the mean value of Ctfor each SNP is subtracted out. The average std. dev. across 20 SNPs inthe mixed male sample using two runs is s=0.597. This SD will beconservatively used for both male and female samples, since SD for thefemale sample will be smaller than for the male sample. In addition,note that the measurement from only one dye is being used, since themixed samples are assumed to be heterozygous for all SNPs. The use ofboth dyes requires the measurements of each allele at a locus to becombined, which is more complicated (see section on linearizingmeasurements). Combining measurements on both dyes would double signalamplitude and increase noise amplitude by roughly sqrt(2), resulting inan SNR improvement of roughly sqrt(2) or 3 dB.

Detection Assuming No Mosaicism and No Reference Sample

Assume that m₀ is known perfectly from many experiments, and everyexperiment runs only one sample to compute m₁ to compare with m₀. N₁ isthe number of assays and assume that each assay is a different SNPlocus. A threshold t can be set half way between m₀ and m₁ to make thelikelihood of false positives equal the number of false negatives, and asample is labeled abnormal if it is above the threshold. Assumes₁=s₂=s=0.597 and use the 5-sigma approach so that the probability offalse negatives or positives is 1-normcdf(5,0,1)=2.87e-7. The goal isfor 5s₁/sqrt(N₁)<(m₁−m₀)/2, hence N₁=100 s₁ ²/(m₁−m₀)²=16. Now, anapproach where the probability of a false positive is allowed to behigher than the probability of a false negatives, which is the harmfulscenario, may also be used. If a positive is measured, the experimentmay be rerun. Consequently, it is possible to say that the probabilityof a false negative should be equal to the square of the probability ofa false positive. Consider FIG. 1, let t=threshold, and assumeSigma_0=Sigma_1=s. Thus(1−normcdf((t−m₀)/s,0,1))²=1−normcdf((m₁−t)/s,0,1). Solving this, it canbe shown that t=m₀+0.32(m₁−m₀). Hence the goal is for5s/sqrt(N₁)<m₁−m₀-0.32(m₁−m₀)=(m₁−m₀)/1.47, henceN₁=(5²)(1.47²)s²/(m₁−m₀)²=9.

Detection with Mosaicism without Running a Reference Sample

Assume the same situation as above, except that the goal is to detectmosaicism with a probability of 97.7% (i.e. 2-sigma approach). This isbetter than the standard approach to amniocentesis which extractsroughly 20 cells and photographs them. If one assumes that 1 in 20 cellsis aneuploid and this is detected with 100% reliability, the probabilityof having at least one of the group being aneuploid using the standardapproach is 1-0.95²⁰=64%. If 0.05% of the cells are aneuploid (call thissample 3) then m₃=0.95m₀+0.05m₁ and var(m₃)=(0.95s₀ ²+0.05s₁ ²)/N₁.Thus, std(m₃)2<(m₃−m₀)/2=>sqrt(0.95 s₀ ²+0.05 s₁ ²)/sqrt(N₁)<0.05(m₁−m₂)/4=>N₁=16(0.95s₂ ²+0.05s₁ ²)/(0.05²(m₁−m₂)²)=1001. Note thatusing the goal of 1-sigma statistics, which is still better than can beachieved using the conventional approach (i.e. detection with 84.1%probability), it can be shown in a similar manner that N₁=250.

Detection with No Mosaicism and Using a Reference Sample

Although this approach may not be necessary, assume that everyexperiment runs two samples in order to compare m₁ with truth sample m₂.Assume that N=N₁=N₀. Compute d=m₁−m₀ and, assuming σ₁=σ₀, set athreshold t=(m₀+m₁)/2 so that the probability of false positives andfalse negatives is equal. To make the probability of false negatives2.87e-7, it must be the case that (m 1−m₂)/2>5sqrt(s₁ ²/N+s₂²/N)=>N=100(s ₁ ²+s₂ ²)/(m 1−m₂)²=32.

Detection with Mosaicism and Running a Reference Sample

As above, assume the probability of false negatives is 2.3% (i.e.2-sigma approach). If 0.05% of the cells are aneuploid (call this sample3) then m₃=0.95m₀+0.05m₁ and var(m₃)=(0.95s₀ ²+0.05s₁ ²)/N₁. d=m₃−m₂ andσ_(d) ²=(1.95s₀ ²+0.05s₁ ²)/N. It must be thatstd(m_(c3))2<(m₀−m₂)/2=>sqrt(1.95 s₂ ²+0.05 s₁²)/sqrt(N)<0.05(m₁−m₂)/4=>N=16(1.95 s₂ ²+0.05 s₁²)/(0.05²(m₁−m₂)²)=2002. Again using 1-sigma approach, it can be shownin a similar manner that N=500.

Consider the case if the goal is only to detect 5% mosaicism with aprobability of 64% as is the current state of the art. Then, theprobability of false negative would be 36%. In other words, it would benecssary to find x such that 1-normcdf(x,0,1)=36%. ThusN=4(0.3{circumflex over ( )}2)(1.95s₂ ²+0.05s₁ ²)/(0.05²(m₁−m₂)²)=65 forthe 2-sigma approach, or N=33 for the 1-sigma approach. Note that thiswould result in a very high level of false positives, which needs to beaddressed, since such a level of false positives is not currently aviable alternative.

Also note that if N is limited to 384 (i.e. one 384 well Taqman plateper chromosome), and the goal is to detect mosaicism with a probabilityof 97.72%, then it will be possible to detect mosaicism of 8.1% usingthe 1-sigma approach. In order to detect mosaicism with a probability of84.1% (or with a 15.9% false negative rate), then it will be possible todetect mosaicism of 5.8% using the 1-sigma approach. To detect mosaicismof 19% with a confidence of 97.72% it would require roughly 70 loci.Thus one could screen for 5 chromosomes on a single plate.

The summary of each of these different scenarios is provided in Table 1.Also included in this table are the results generated from qPCR and theSYBR assays. The methods described above were used and the simplifyingassumption was made that the performance of the qPCR assay for eachlocus is the same. FIGS. 8 and 9 show the histograms for samples 1 and0, as described above. N₀=N₁=47. The distributions of the measurementsfor these samples are characterized by m₁=27.65, s₁=1.40,σ_(m1)=s₁/sqrt(N₁)=0.204; m₀=26.64; s₀=1.146, σ_(m0)=s₀/sqrt(N₀)=0.167.For these samples d=1.01 and σ_(d)=0.2636. FIG. 10 shows the differencebetween C_(t) for the male and female samples for each locus, with astandard deviation of the difference over all loci of 0.75. The SD wasapproximated for each measurement of each locus on the male or femalesample as 0.75/sqrt(2)=0.53.

Method 2: Qualitative Techniques that Use Allele Calls

In this section, no assumption is made that the assay is quantitative.Instead, the assumption is that the allele calls are qualitative, andthat there is no meaningful quantitative data coming from the assays.This approach is suitable for any assay that makes an allele call. FIG.11 describes how different haploid gametes form during meiosis, and willbe used to describe the different kinds of aneuploidy that are relevantfor this section. The best algorithm depends on the type of aneuploidythat is being detected.

Consider a situation where aneuploidy is caused by a third segment thathas no section that is a copy of either of the other two segments. FromFIG. 11, the situation would arise, for example, if p₁ and p₄, or p₂ andp₃, both arose in the child cell in addition to one segment from theother parent. This is very common, given the mechanism which causesaneuploidy. One approach is to start off with a hypothesis h₀ that thereare two segments in the cell and what these two segments are. Assume,for the purpose of illustration, that h₀ is for p₃ and m₄ from FIG. 11.In a preferred embodiment this hypothesis comes from algorithmsdescribed elsewhere in this document. Hypothesis h₁ is that there is anadditional segment that has no sections that are a copy of the othersegments. This would arise, for example, if p₂ or m₁ was also present.It is possible to identify all loci that are homozygous in p₃ and m₄.Aneuploidy can be detected by searching for heterozygous genotype callsat loci that are expected to be homozygous.

Assume every locus has two possible alleles, x and y. Let theprobability of alleles x and y in general be p_(x) and p_(y)respectively, and p_(x)+p_(y)=1. If h₁ is true, then for each locus ifor which p₃ and m₄ are homozygous, then the probability of anon-homozygous call is p_(y) or p_(x), depending on whether the locus ishomozygous in x or y respectively. Note: based on knowledge of theparent data, i.e. p₁, p₂, p₄ and m₁, m₂, m₃, it is possible to furtherrefine the probabilities for having non-homozygous alleles x or y ateach locus. This will enable more reliable measurements for eachhypothesis with the same number of SNPs, but complicates notation, sothis extension will not be explicitly dealt with. It should be clear tosomeone skilled in the art how to use this information to increase thereliability of the hypothesis.

The probability of allele dropouts is p_(d). The probability of findinga heterozygous genotype at locus i is poi given hypothesis h₀ and p_(1i)given hypothesis h₁d.

Given h₀: p_(0i)=0

Given h₁: p_(1i)=p_(x)(1-p_(d)) or p_(1i)=p_(y)(1−p_(d)) depending onwhether the locus is homozygous for x or y.

Create a measurement m=1/N_(h)Σ_(i=1 . . . Nh) I_(i) where I_(i) is anindicator variable, and is 1 if a heterozygous call is made and 0otherwise. N_(h) is the number of homozygous loci. One can simplify theexplanation by assuming that p_(x)=p_(y) and p_(0i), p_(1i) for all lociare the same two values p₀ and p₁. Given h₀, E(m)=p₀=0 and σ²_(m|h0)=p₀(1−p₀)/N_(h). Given h₁, E(m)=p₁ and σ² _(m|h1)=p₁(1−p₁)/N_(h).Using 5 sigma-statistics, and making the probability of false positivesequal the probability of false negatives, it can be shown that(p₁−p₀)/2>5σ_(m|h1) hence N_(h)=100(p₀(1−p₀)+p₁(1−p₁))/(p₁−p₀)². For2-sigma confidence instead of 5-sigma confidence, it can be shown thatN_(h) d=4.2²(p₀(1−p₀)+p₁(1−p₁))/(p₁−p₀)².

It is necessary to sample enough loci N that there will be sufficientavailable homozygous loci N_(h-avail) such that the confidence is atleast 97.7% (2-sigma). Characterize N_(h-avail)=Σ_(i=1 . . . N) J_(i)where J_(i) is an indicator variable of value 1 if the locus ishomozygous and 0 otherwise. The probability of the locus beinghomozygous is p_(x) ²+p_(y) ². Consequently, E(N_(h-avail))=N(p_(x)²±p_(y) ²) and σN_(h-avail) ²=N(p_(x) ²+p_(y) ²)(1−p_(x) ²−p_(y) ².) Toguarantee N is large enough with 97.7% confidence, it must be thatE(N_(h-avail))−2σ_(Nh-avail)d=N_(h) where N_(h) is found from above.

For example, if one assumes p_(d)=0.3, p_(x)=p_(y)=0.5, one can findN_(h)=186 and N=391 for 5-sigma confidence. Similarly, it is possible toshow that N_(h)=30 and N=68 for 2-sigma confidence i.e. 97.7% confidencein false negatives and false positives.

Note that a similar approach can be applied to looking for deletions ofa segment when h₀ is the hypothesis that two known chromosome segmentare present, and h₁ is the hypothesis that one of the chromosomesegments is missing. For example, it is possible to look for all ofthose loci that should be heterozygous but are homozygous, factoring inthe effects of allele dropouts as has been done above.

Also note that even though the assay is qualitative, allele dropoutrates may be used to provide a type of quantitative measure on thenumber of DNA segments present.

Method 3: Making use of Known Alleles of Reference Sequences, andQuantitative Allele Measurements

Here, it is assumed that the alleles of the normal or expected set ofsegments are known. In order to check for three chromosomes, the firststep is to clean the data, assuming two of each chromosome. In apreferred embodiment of the invention, the data cleaning in the firststep is done using methods described elsewhere in this document. Thenthe signal associated with the expected two segments is subtracted fromthe measured data. One can then look for an additional segment in theremaining signal. A matched filtering approach is used, and the signalcharacterizing the additional segment is based on each of the segmentsthat are believed to be present, as well as their complementarychromosomes. For example, considering FIG. 11, if the results of PSindicate that segments p2 and m₁ are present, the technique describedhere may be used to check for the presence of p2, p3, m₁ and m₄ on theadditional chromosome. If there is an additional segment present, it isguaranteed to have more than 50% of the alleles in common with at leastone of these test signals. Note that another approach, not described indetail here, is to use an algorithm described elsewhere in the documentto clean the data, assuming an abnormal number of chromosomes, namely 1,3, 4 and 5 chromosomes, and then to apply the method discussed here. Thedetails of this approach should be clear to someone skilled in the artafter having read this document.

Hypothesis h₀ is that there are two chromosomes with allele vectors a₁,a₂. Hypothesis h₁ is that there is a third chromosome with allele vectora₃. Using a method described in this document to clean the genetic data,or another technique, it is possible to determine the alleles of the twosegments expected by h₀: a₁=[a₁₁ . . . a_(1N)] and a₂=[a₂₁ . . . a_(2N)]where each element a_(ji) is either x or y. The expected signal iscreated for hypothesis h₀: s_(0x)=[f_(0x)(a₁₁, a₂₁) . . . f_(x0)(a_(1N),a_(2N))], s_(0y)=[f_(y)(a₁₁, a₂₁) . . . f_(y)(a_(1N), a_(2N))] wheref_(x), f_(y) describe the mapping from the set of alleles to themeasurements of each allele. Given h₀, the data may be described asd_(xi)=s_(0xi)+n_(xi), n_(xi)˜N(0,σ_(xi) ²); d_(yi)=s_(0yi)+n_(yi),n_(yi)˜N(0,σ_(yi) ²). Create a measurement by differencing the data andthe reference signal: m_(xi)=d_(xi)−s_(xi); m_(yi)=d_(yi)−s_(yi). Thefull measurement vector is m=[m_(x) ^(T) m_(y) ^(T)]^(T).

Now, create the signal for the segment of interest—the segment whosepresence is suspected, and will be sought in the residual—based on theassumed alleles of this segment: a₃=[a₃₁ . . . a_(3N)]. Describe thesignal for the residual as: s_(r)=[s_(rx) ^(T) S_(ry) ^(T)]^(T) wheres_(rx)=[f_(rx)d(a₃₁) . . . f_(rx)(a_(3N))], s_(ry)=[f_(ry)(a₃₁) . . .f_(ry)(a_(3N))] where f_(rx)(a_(3i))=δ_(xi) if a_(3i)=x and 0 otherwise,f_(ry)(a_(3i))=δ_(yi) if a_(3i)=y and 0 otherwise. This analysis assumesthat the measurements have been linearized (see section below) so thatthe presence of one copy of allele x at locus i generates dataδ_(xi)+n_(xi) and the presence of κ_(x) copies of the allele x at locusi generates data κ_(x)δ_(xi)+n_(xi). Note however that this assumptionis not necessary for the general approach described here. Given h₁, ifallele a_(3i)=x then m_(xi)d=δ_(xi)+n_(xi), m_(yi)=n_(yi) and ifa_(3i)=y then m_(xi)=n_(xi), m_(yi)=δ_(yi)d+n_(yi). Consequently, amatched filter h=(1/N)R⁻¹s_(r) can be, created where R=diag([σ_(x1) ² .. . σ_(xN) ²σ_(y1) ² . . . σ_(yN) ²]). The measurement is m=h^(T)d.

h ₀ : m=(1/N)Σ_(i=1 . . . N) s _(rxi) n _(xi)/σ_(xi) ² +s _(ryi) n_(yi)/σ_(yi) ²

h ₁ : m=(1/N)Σ_(i=1 . . . N) S _(rxi)(δ_(xi) ±n _(xi))/σ_(xi) ² +S_(ryi) d(δ_(yi) +n _(yi))/σ_(yi) ²

In order to estimate the number of SNPs required, make the simplifyingassumptions that all assays for all alleles and all loci have similarcharacteristics, namely that δ_(xi)=δ_(yi)=δ and σ_(xi)=σ_(yi)=σ for i=1. . . N. Then, the mean and standard deviation may be found as follows:

h ₀ : E(m)=m ₀=0;σ_(m|h0) ²=(1/N ²σ⁴)(N/2)(σ²δ²+σ²δ²)=δ² /Nσ ²)

h ₁ : E(m)=m ₁=(1/N)(N/2σ²)(δ²+δ²)=δ²/σ²;σ_(m|h1) ²=(1/N²σ⁴)(N)(σ²δ²)=δ²/(Nσ ²)

Now compute a signal-to-noise ratio (SNR) for this test of h₁ versush_(a). The signal is m₁−m₀=δ²/σ², and the noise variance of thismeasurement is σ_(m|h0) ²+σ_(m|h1) ²=2δ²/(Nσ²). Consequently, the SNRfor this test is (δ⁴/σ⁴)/(2δ²/(Nσ²))=Nδ²/(2σ²).

Compare this SNR to the scenario where the genetic information is simplysummed at each locus without performing a matched filtering based on theallele calls. Assume that h=(1/N)ī where ī is the vector of N ones, andmake the simplifying assumptions as above that δ_(xi)=δ_(yi)=δ andσ_(xi)=σ_(yi)=σ for i=1 . . . N. For this scenario, it isstraightforward to show that if m=h^(T)d:

h ₀ : E(m)=m ₀=0;σ_(m|h0) ² =Nσ ² /N ² +Nσ ² /N ²=2σ² /N

h ₁ : E(m)=m ₁=(1/N)(Nδ/2+Nδ/2)=δ;σ_(m|h1) ²=(1/N ²)(Nσ ² +Nσ ²)=2σ² /N

Consequently, the SNR for this test is Nδ²/(4∝²). In other words, byusing a matched filter that only sums the allele measurements that areexpected for segment a3, the number of SNPs required is reduced by afactor of 2. This ignores the SNR gain achieved by using matchedfiltering to account for the different efficiencies of the assays ateach locus.

Note that if we do not correctly characterize the reference signalss_(xi) and s_(yi) then the SD of the noise or disturbance on theresulting measurement signals m_(xi) and m_(yi) will be increased. Thiswill be insignificant if δ<<σ, but otherwise it will increase theprobability of false detections. Consequently, this technique is wellsuited to test the hypothesis where three segments are present and twosegments are assumed to be exact copies of each other. In this case,s_(xi) and s_(yi) will be reliably known using techniques of datacleaning based on qualitative allele calls described elsewhere. In oneembodiment method 3 is used in combination with method 2 which usesqualitative genotyping and, aside from the quantitative measurementsfrom allele dropouts, is not able to detect the presence of a secondexact copy of a segment.

We now describe another quantitative technique that makes use of allelecalls. The method involves comparing the relative amount of signal ateach of the four registers for a given allele. One can imagine that inthe idealized case involving a single, normal cell, where homogenousamplification occurs, (or the relative amounts of amplification arenormalized), four possible situations can occur: (i) in the case of aheterozygous allele, the relative intensities of the four registers willbe approximately 1:1:0:0, and the absolute intensity of the signal willcorrespond to one base pair; (ii) in the case of a homozygous allele,the relative intensities will be approximately 1:0:0:0, and the absoluteintensity of the signal will correspond to two base pairs; (iii) in thecase of an allele where ADO occurs for one of the alleles, the relativeintensities will be approximately 1:0:0:0, and the absolute intensity ofthe signal will correspond to one base pair; and (iv) in the case of anallele where ADO occurs for both of the alleles, the relativeintensities will be approximately 0:0:0:0, and the absolute intensity ofthe signal will correspond to no base pairs.

In the case of aneuploides, however, different situations will beobserved. For example, in the case of trisomy, and there is no ADO, oneof three situations will occur: (i) in the case of a triply heterozygousallele, the relative intensities of the four registers will beapproximately 1:1:1:0, and the absolute intensity of the signal willcorrespond to one base pair; (ii) in the case where two of the allelesare homozygous, the relative intensities will be approximately 2:1:0:0,and the absolute intensity of the signal will correspond to two and onebase pairs, respectively; (iii) in the case where are alleles arehomozygous, the relative intensities will be approximately 1:0:0:0, andthe absolute intensity of the signal will correspond to three basepairs. If allele dropout occurs in the case of an allele in a cell withtrisomy, one of the situations expected for a normal cell will beobserved. In the case of monosomy, the relative intensities of the fourregisters will be approximately 1:0:0:0, and the absolute intensity ofthe signal will correspond to one base pair. This situation correspondsto the case of a normal cell where ADO of one of the alleles hasoccurred, however in the case of the normal cell, this will only beobserved at a small percentage of the alleles. In the case ofuniparental disomy, where two identical chromosomes are present, therelative intensities of the four registers will be approximately1:0:0:0, and the absolute intensity of the signal will correspond to twobase pairs. In the case of UPD where two different chromosomes from oneparent are present, this method will indicate that the cell is normal,although further analysis of the data using other methods described inthis patent will uncover this.

In all of these cases, either in cells that are normal, have aneuploidesor UPD, the data from one SNP will not be adequate to make a decisionabout the state of the cell. However, if the probabilities of each ofthe above hypothesis are calculated, and those probabilities arecombined for a sufficient number of SNPs on a given chromosome, onehypothesis will predominate, it will be possible to determine the stateof the chromosome with high confidence.

Methods for Linearizing Quantitative Measurements

Many approaches may be taken to linearize measurements of the amount ofgenetic material at a specific locus so that data from different allelescan be easily summed or differenced. We first discuss a generic approachand then discuss an approach that is designed for a particular type ofassay.

Assume data d_(xi) refers to a nonlinear measurement of the amount ofgenetic material of allele x at locus i. Create a training set of datausing N measurements, where for each measurement, it is estimated orknown that the amount of genetic material corresponding to data d_(xi)is β_(xi). The training set β_(xi), i=1 . . . N, is chosen to span allthe different amounts of genetic material that might be encountered inpractice. Standard regression techniques can be used to train a functionthat maps from the nonlinear measurement, d_(xi) to the expectation ofthe linear measurement, E(β_(xi)). For example, a linear regression canbe used to train a polynomial function of order P, such thatE(β_(xi))=[1 d_(xi) d_(xi) ² d_(xi) ^(P)]c where c is the vector ofcoefficients c=[c₀ c₁ . . . c_(P)]^(T). To train this linearizingfunction, we create a vector of the amount of genetic material for Nmeasurements β_(x)=[(β_(x1) . . . β_(xN)]^(T) and a matrix of themeasured data raised to powers 0 . . . P: D=[[1 d_(x1), d_(x1) ² . . .d_(x1) ^(P)]^(T) [1 d_(x2) d_(x2) ² . . . d_(x2) ^(P)]^(T) . . . [1d_(xN) d_(xN) ² d_(xN) ^(P)]^(T)]^(T). The coefficients can then befound using a least squares fit c=(D^(T)D)⁻¹D^(T)β_(x).

Rather than depend on generic functions such as fitted polynomials, wemay also create specialized functions for the characteristics of aparticular assay. We consider, for example, the Taqman assay or a qPCRassay. The amount of die for allele x and some locus i, as a function oftime up to the point where it crosses some threshold, may be describedas an exponential curve with a bias offset:g_(xi)(t)=a_(xi)+β_(xi)exp(γ_(xi)t) where α_(xi) is the bias offset,γ_(xi) is the exponential growth rate, and β_(xi) corresponds to theamount of genetic material. To cast the measurements in terms of β_(xi),compute the parameter α_(xi) by looking at the asymptotic limit of thecurve g_(xi)(−∞) and then may find β_(xi) and y_(xi) by taking the logof the curve to obtain log(g_(xi)(t)−α_(xi)d)=log(β_(xi))+y_(xi)t andperforming a standard linear regression. Once we have values for α_(xi)and γ_(xi), another approach is to compute β_(xi) from the time, t_(x),at which the threshold g_(x) is exceeded.β_(xi)=(g_(x)−α_(xi))exp(−y_(xi)t_(x)). This will be a noisy measurementof the true amount of genetic data of a particular allele.

Whatever techniques is used, we may model the linearized measurement asβ_(xi)=κ_(x)δ_(xi)+n_(xi) where κ_(x) is the number of copies of allelex, δ_(xi) is a constant for allele x and locus i, and n_(xi)˜N(0, σ_(x)²) where σ_(x) ² can be measured empirically.

Method 4: Using a Probability Distribution Function for theAmplification of Genetic Data at Each Locus

The method described here is relevant for high throughput genotype dataeither generated by a PCR-based approach, for example using anAffymetrix Genotyping Array, or using the Molecular Inversion Probe(MIPs) technique, with the Affymetrix GenFlex Tag Array. In the formercase, the genetic material is amplified by PCR before hybridization toprobes on the genotyping array to detect the presence of particularalleles. In the latter case, padlock probes are hybridized to thegenomic DNA and a gap-fill enzyme is added which can add one of the fournucleotides. If the added nucleotide (A, C, T, G) is complementary tothe SNP under measurement, then it will hybridize to the DNA, and jointhe ends of the padlock probe by ligation. The closed padlock probes arethen differentiated from linear probes by exonucleolysis. The probesthat remain are then opened at a cleavage site by another enzyme,amplified by PCR, and detected by the GenFlex Tag Array. Whichevertechnique is used, the quantity of material for a particular SNP willdepend on the number of initial chromosomes in the cell on which thatSNP is present. However, due to the random nature of the amplificationand hybridization process, the quantity of genetic material from aparticular SNP will not be directly proportion.al to the starting numberof chromosomes. Let q_(s,A), q_(s,G), q_(s,T), q_(s,C) represent theamplified quantity of genetic material for a particular SNP s for eachof the four nucleic acids (A, C, T, G) constituting the alleles. Notethat these quantities are typically measured from the intensity ofsignals from particular hybridization probes on the array. Thisintensity measurement can be used instead of a measurement of quantity,or can be converted into a quantity estimate using standard techniqueswithout changing the nature of the invention. Let q_(s) be the sum ofall the genetic material generated from all alleles of a particular SNP:q_(s)d=q_(s,A)+q_(s,G)+q_(s,T)+q_(s,C). Let N be the number ofchromosomes in a cell containing the SNP s. N is typically 2, but may be0, 1 or 3 or more. For either high-throughput genotyping methoddiscussed above, and many other methods, the resulting quantity ofgenetic material can be represented as q_(s)=(A+A_(θ,s))N+θ_(s) where Ais the total amplification that is either estimated a-priori or easilymeasured empirically, A_(θ,s) is the error in the estimate of A for theSNP s, and θ_(s) is additive noise introduced in the amplification,hybridization and other process for that SNP. The noise terms A_(θ,s)and θ_(s) are typically large enough that q_(s) will not be a reliablemeasurement of N. However, the effects of these noise terms can bemitigated by measuring multiple SNPs on the chromosome. Let S be thenumber of SNPs that are measured on a particular chromosome, such aschromosome 21. We can then generate the average quantity of geneticmaterial over all SNPs on a particular chromosome

$\begin{matrix}{q = {{\frac{1}{S}{\sum\limits_{s = 1}^{S}\; q_{s}}} = {{NA} + {\frac{1}{S}{\sum\limits_{s = 1}^{S}\; {A_{\theta,s}N}}} + \theta_{s}}}} & (15)\end{matrix}$

Assuming that A_(θ,s) and θ_(s) are normally distributed randomvariables with 0 means and variances σ² _(A) _(θ,s) and σ² _(θ) _(s) ,we can model q=NA+φ where φ is a normally distributed random variablewith 0 mean and variance

$\frac{1}{S}{\left( {{N^{2}\theta_{A_{\theta,s}}^{2}} + \sigma_{\theta}^{2}} \right).}$

Consequently, if we measure a sufficient number of SNPs on thechromosome such that S·>>(N²σ² _(A) _(θ,s+σ) ² _(θ), we can accuratelyestimate N=q/A.

The quantity of material for a particular SNP will depend on the numberof initial segments in the cell on which that SNP is present. However,due to the random nature of the amplification and hybridization process,the quantity of genetic material from a particular SNP will not bedirectly proportional to the starting number of segments. Let q_(s,A),q_(s,G), q_(s,T), q_(s,C) represent the amplified quantity of geneticmaterial for a particular SNP s for each of the four nucleic acids(A,C,T,G) constituting the alleles. Note that these quantities may beexactly zero, depending on the technique used for amplification. Alsonote that these quantities are typically measured from the intensity ofsignals from particular hybridization probes. This intensity measurementcan be used instead of a measurement of quantity, or can be convertedinto a quantity estimate using standard techniques without changing thenature of the invention. Let qs be the sum of all the genetic materialgenerated from all alleles of a particular SNP:q_(s)=q_(s,A)+q_(s,G)+q_(s,T)+q_(s,C). Let N be the number of segmentsin a cell containing the SNP s. N is typically 2, but may be 0,1 or 3 ormore. For any high or medium throughput genotyping method discussed, theresulting quantity of genetic material can be represented asq_(s)=(A+A_(θ,s))N+θ_(s) where A is the total amplification that iseither estimated a-priori or easily measured empirically, A_(θ,5) is theerror in the estimate of A for the SNP s, and θ_(s) is additive noiseintroduced in the amplification, hybridization and other process forthat SNP. The noise terms A_(θ,s) and θ_(s) are typically large enoughthat q_(s) will not be a reliable measurement of N. However, the effectsof these noise terms can be mitigated by measuring multiple SNPs on thechromosome. Let S be the number of SNPs that are measured on aparticular chromosome, such as chromosome 21. It is possible to generatethe average quantity of genetic material over all SNPs on a particularchromosome as follows:

$\begin{matrix}{q = {{\frac{1}{S}{\sum\limits_{s = 1}^{S}\; q_{s}}} = {{NA} + {\frac{1}{S}{\sum\limits_{s = 1}^{S}\; {A_{\theta,s}N}}} + \theta_{s}}}} & (16)\end{matrix}$

Assuming that A_(θ,s) and θ_(s) are normally distributed randomvariables with 0 means and variances σ² _(Aθ,S) and σ² _(θ) _(S) , onecan model q=NA+φ where φ is a normally distributed random variable with0 mean and variance

$\frac{1}{S}{\left( {{N^{2}\theta_{A_{\theta,s}}^{2}} + \sigma_{\theta}^{2}} \right).}$

Consequently, if sufficient number of SNPs are measured on thechromosome such that S>>(N²σ² _(A) _(θ,S) +σ² _(θ)), then N=q/A can beaccurately estimated.

In another embodiment, assume that the amplification is according to amodel where the signal level from one SNP is s=a+α where (a+α) has adistribution that looks like the picture in FIG. 12A, left. The deltafunction at 0 models the rates of allele dropouts of roughly 30%, themean is a, and if there is no allele dropout, the amplification hasuniform distribution from 0 to a₀. In terms of the mean of thisdistribution a₀ is found to be a₀=2.86a. Now model the probabilitydensity function of α using the picture in FIG. 12B, right. Let s_(c) bethe signal arising from c loci; let n be the number of segments; letα_(i) be a random variable distributed according to FIGS. 12A and 12Bthat contributes to the signal from locus i; and let a be the standarddeviation for all {α_(i)}. s_(c)=anc+Σ_(i=1 . . . nc) α_(i);mean(s_(c))=anc; std(s_(c))=sqrt(nc)σ. If σ is computed according to thedistribution in FIG. 12B, right, it is found to be σ=0.907a². We canfind the number of segments from n=s_(c)/(ac) and for “5-sigmastatistics” we require std(n)<0.1 sostd(s_(c))/(ac)=0.1=>0.95a.sqrt(nc)/(ac)=0.1 so c=0.95² n/0.1²=181.

Another model to estimate the confidence in the call, and how many locior SNPs must be measured to ensure a given degree of confidence,incorporates the random variable as a multiplier of amplificationinstead of as an additive noise source, namely s=a(1+α). Taking logs,log(s)=log(a)+log(1+a). Now, create a new random variable γ=log(1+α) andthis variable may be assumed to be normally distributed ˜N(0,σ). In thismodel, amplification can range from very small to very large, dependingon σ, but never negative. Therefore α=e^(γ)−1; ands_(c)=Σ_(i=1 cn)a(1+α_(i)). For notation, mean(s_(c)) and expectationvalue E(s_(c)) are used interchangeably

E(S_(c))=acn+aE(Σ_(i=1 . . . cn)α_(i))=acn+aE(Σ_(i=1 . . . cn)α_(i))=acn(1+E(α))

To find E(α) the probability density function (pdf) must be found for awhich is possible since a is a function of γ which has a known Gaussianpdf. p_(α)(a)=p_(γ)(γ)(dγ/dα). So:

${p_{\gamma}(\gamma)} = {{\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{- \gamma^{2}}{2\sigma^{2}}}\mspace{14mu} {and}\mspace{14mu} \frac{d\; \gamma}{d\; \alpha}} = {{\frac{d}{d\; \alpha}\left( {\log \left( {1 + \alpha} \right)} \right)} = {\frac{1}{1 + \alpha}e^{- \gamma}}}}$and:${p_{\alpha}(\alpha)} = {{\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{- \gamma^{2}}{2\sigma^{2}}}e^{- \gamma}} = {\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{{{- {\log {({1 + \alpha})}}})}^{2}}{2\sigma^{2}}}\frac{1}{1 + \alpha}}}$

This has the form shown in FIG. 13 for σ=1. Now, E(a) can be found byintegrating over this pdf E(α)=∫_(−∞) ^(∞)αp_(α)(α)dα which can be donenumerically for multiple different σ. This gives E(s_(c)) or mean(s_(c))as a function of σ. Now, this pdf can also be used to find var(s_(c)):

$\begin{matrix}{{{var}\left( S_{C} \right)} = {{E\left( {S_{c} - {E\left( S_{C} \right)}} \right)}^{2} = {E\left( {{\Sigma_{i = {1\ldots \; {cn}}}{a\left( {1 + \alpha_{i}} \right)}} - {acn} - {{aE}\left( {\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)}} \right)}^{2}}} \\{= {E\left( {{\Sigma_{i = {1\ldots \; {cn}}}a\; \alpha_{i}} - {{aE}\left( {\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)}} \right)}^{2}} \\{= {a^{2}{E\left( {{\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} - {{cnE}(\alpha)}} \right)}^{2}}} \\{= {a^{2}{E\left( {\left( {\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)^{2} - {2{{cnE}(\alpha)}\left( {\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)} + {c^{2}n^{2}{E(\alpha)}^{2}}} \right)}}} \\{= {a^{2}{E\left( {{{cn}\; \alpha^{2}} + {{{cn}\left( {{cn} - 1} \right)}\alpha_{i}\alpha_{j}} - {2{{cnE}(\alpha)}\left( {\Sigma_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)} + {c^{2}n^{2}{E(\alpha)}^{2}}} \right)}}} \\{= {a^{2}c^{2}{n^{2}\left( {{E\left( \alpha^{2} \right)} + {\left( {{cn} - 1} \right){E\left( {\alpha_{i}\alpha_{j}} \right)}} - {2{{cnE}(\alpha)}^{2}} + {{cnE}(\alpha)}^{2}} \right)}}} \\{= {a^{2}c^{2}{n^{2}\left( {{E(\alpha)}^{2} + {\left( {{cn} - 1} \right){E\left( {\alpha_{i}\alpha_{j}} \right)}} - {{cnE}(\alpha)}^{2}} \right)}}}\end{matrix}$

which can also be solved numerically using p_(α)(α) for multipledifferent σ to get var(s_(c)) as a function of σ. Then, we may take aseries of measurements from a sample with a known number of loci c and aknown number of segments n and find std(s_(c))/E(s_(c)) from this data.That will enable us to compute a value for σ. In order to estimate n,E(s_(c))=nac(1+E(α)) so

$\hat{n} = \frac{S_{c}}{{ac}\left( {1 + \left( {E(\alpha)} \right)} \right.}$

can be measured so that

${{std}\left( \hat{n} \right)} = {\frac{{std}\left( S_{c} \right)}{{ac}\left( {1 + \left( {E(\alpha)} \right)} \right.}{{std}(n)}}$

When summing a sufficiently large number of independent random variablesof 0-mean, the distribution approaches a Gaussian form, and thus s_(c)(and {circumflex over (n)}) can be treated as normally distributed andas before we may use 5-sigma statistics:

${{std}\left( \hat{n} \right)} = {\frac{{std}\left( S_{c} \right)}{{ac}\left( {1 + \left( {E(\alpha)} \right)} \right.} < 0.1}$

in order to have an error probability of 2normcdf(5,0,1)=2.7e-7. Fromthis, one can solve for the number of loci c.

Sexing

In one embodiment of the system, the genetic data can be used todetermine the sex of the target individual. After the method disclosedherein is used to determine which segments of which chromosomes from theparents have contributed to the genetic material of the target, the sexof the target can be determined by checking to see which of the sexchromosomes have been inherited from the father: X indicates a female,and Y indicates a make. It should be obvious to one skilled in the arthow to use this method to determine the sex of the target.

Validation of the Hypotheses

In some embodiments of the system, one drawback is that in order to makea prediction of the correct genetic state with the highest possibleconfidence, it is necessary to make hypotheses about every possiblestates. However, as the possible number of genetic states areexceptionally large, and computational time is limited, it may not bereasonable to test every hypothesis. In these cases, an alternativeapproach is to use the concept of hypothesis validation. This involvesestimating limits on certain values, sets of values, properties orpatterns that one might expect to observe in the measured data if acertain hypothesis, or class of hypotheses are true. Then, the measuredvalues can tested to see if they fall within those expected limits,and/or certain expected properties or patterns can be tested for, and ifthe expectations are not met, then the algorithm can flag thosemeasurements for further investigation.

For example, in a case where the end of one arm of a chromosome isbroken off in the target DNA, the most likely hypothesis may becalculated to be “normal” (as opposed, for example to “aneuploid”). Thisis because the particular hypotheses that corresponds to the true stateof the genetic material, namely that one end of the chromosome hasbroken off, has not been tested, since the likelihood of that state isvery low. If the concept of validation is used, then the algorithm willnote that a high number of values, those that correspond to the allelesthat lie on the broken off section of the chromosome, lay outside theexpected limits of the measurements. A flag will be raised, invitingfurther investigation for this case, increasing the likelihood that thetrue state of the genetic material is uncovered.

It should be obvious to one skilled in the art how to modify thedisclosed method to include the validation technique. Note that oneanomaly that is expected to be very difficult to detect using thedisclosed method is balanced translocations.

M Notes

As noted previously, given the benefit of this disclosure, there aremore embodiments that may implement one or more of the systems, methods,and features, disclosed herein.

In all cases concerning the determination of the probability of aparticular qualitative measurement on a target individual based onparent data, it should be obvious to one skilled in the art, afterreading this disclosure, how to apply a similar method to determine theprobability of a quantitative measurement of the target individualrather than qualitative. Wherever genetic data of the target or relatedindividuals is treated qualitatively, it will be clear to one skilled inthe art, after reading this disclosure, how to apply the techniquesdisclosed to quantitative data.

It should be obvious to one skilled in the art that a plurality ofparameters may be changed without changing the essence of the invention.For example, the genetic data may be obtained using any high throughputgenotyping platform, or it may be obtained from any genotyping method,or it may be simulated, inferred or otherwise known. A variety ofcomputational languages could be used to encode the algorithms describedin this disclosure, and a variety of computational platforms could beused to execute the calculations. For example, the calculations could beexecuted using personal computers, supercomputers, a massively parallelcomputing platform, or even non-silicon based computational platformssuch as a sufficiently large number of people armed with abacuses.

Some of the math in this disclosure makes hypotheses concerning alimited number of states of aneuploidy. In some cases, for example, onlymonosomy, disomy and trisomy are explicitly treated by the math. Itshould be obvious to one skilled in the art how these mathematicalderivations can be expanded to take into account other forms ofaneuploidy, such as nullsomy (no chromosomes present), quadrosomy, etc.,without changing the fundamental concepts of the invention.

When this disclosure discusses a chromosome, this may refer to a segmentof a chromosome, and when a segment of a chromosome is discussed, thismay refer to a full chromosome. It is important to note that the math tohandle a segment of a chromosome is the same as that needed to handle afull chromosome. It should be obvious to one skilled in the art how tomodify the method accordingly

It should be obvious to one skilled in the art that a related individualmay refer to any individual who is genetically related, and thus shareshaplotype blocks with the target individual. Some examples of relatedindividuals include: biological father, biological mother, son,daughter, brother, sister, half-brother, half-sister, grandfather,grandmother, uncle, aunt, nephew, niece, grandson, granddaughter,cousin, clone, the target individual himself/herself/itself, and otherindividuals with known genetic relationship to the target. The term‘related individual’ also encompasses any embryo, fetus, sperm, egg,blastomere, blastocyst, or polar body derived from a related individual.

It is important to note that the target individual may refer to anadult, a juvenile, a fetus, an embryo, a blastocyst, a blastomere, acell or set of cells from an individual, or from a cell line, or any setof genetic material. The target individual may be alive, dead, frozen,or in stasis.

It is also important to note that where the target individual refers toa blastomere that is used to diagnose an embryo, there may be casescaused by mosaicism where the genome of the blastomere analyzed does notcorrespond exactly to the genomes of all other cells in the embryo.

It is important to note that it is possible to use the method disclosedherein in the context of cancer genotyping and/or karyotyping, where oneor more cancer cells is considered the target individual, and thenon-cancerous tissue of the individual afflicted with cancer isconsidered to be the related individual. The non-cancerous tissue of theindividual afflicted with the target could provide the set of genotypecalls of the related individual that would allow chromosome copy numberdetermination of the cancerous cell or cells using the methods disclosedherein.

It is important to note that the method described herein concerns thecleaning of genetic data, and as all living or once living creaturescontain genetic data, the methods are equally applicable to any live ordead human, animal, or plant that inherits or inherited chromosomes fromother individuals.

It is important to note that in many cases, the algorithms describedherein make use of prior probabilities, and/or initial values. In somecases the choice of these prior probabilities may have an impact on theefficiency and/or effectiveness of the algorithm. There are many waysthat one skilled in the art, after reading this disclosure, could assignor estimate appropriate prior probabilities without changing theessential concept of the patent.

It is also important to note that the embryonic genetic data that can begenerated by measuring the amplified DNA from one blastomere can be usedfor multiple purposes. For example, it can be used for detectinganeuploidy, uniparental disomy, sexing the individual, as well as formaking a plurality of phenotypic predictions based onphenotype-associated alleles. Currently, in IVF laboratories, due to thetechniques used, it is often the case that one blastomere can onlyprovide enough genetic material to test for one disorder, such asaneuploidy, or a particular monogenic disease. Since the methoddisclosed herein has the common first step of measuring a large set ofSNPs from a blastomere, regardless of the type of prediction to be made,a physician, parent, or other agent is not forced to choose a limitednumber of disorders for which to screen. Instead, the option exists toscreen for as many genes and/or phenotypes as the state of medicalknowledge will allow. With the disclosed method, one advantage toidentifying particular conditions to screen for prior to genotyping theblastomere is that if it is decided that certain loci are especiallyrelevant, then a more appropriate set of SNPs which are more likely tocosegregate with the locus of interest, can be selected, thus increasingthe confidence of the allele calls of interest.

It is also important to note that it is possible to perform haplotypephasing by molecular haplotyping methods. Because separation of thegenetic material into haplotypes is challenging, most genotyping methodsare only capable of measuring both haplotypes simultaneously, yieldingdiploid data. As a result, the sequence of each haploid genome cannot bedeciphered. In the context of using the disclosed method to determineallele calls and/or chromosome copy number on a target genome, it isoften helpful to know the maternal haplotype; however, it is not alwayssimple to measure the maternal haplotype. One way to solve this problemis to measure haplotypes by sequencing single DNA molecules or clonalpopulations of DNA molecules. The basis for this method is to use anysequencing method to directly determine haplotype phase by directsequencing of a single DNA molecule or clonal population of DNAmolecules. This may include, but not be limited to: cloning amplifiedDNA fragments from a genome into a recombinant DNA constructs andsequencing by traditional dye-end terminator methods, isolation andsequencing of single molecules in colonies, and direct single DNAmolecule or clonal DNA population sequencing using next-generationsequencing methods.

The systems, methods, and techniques of the present invention may beused to in conjunction with embyro screening or prenatal testingprocedures. The systems, methods, and techniques of the presentinvention may be employed in methods of increasing the probability thatthe embryos and fetuses obtain by in vitro fertilization aresuccessfully implanted and carried through the full gestation period.Further, the systems, methods, and techniques of the present inventionmay be employed in methods of decreasing the probability that theembryos and fetuses obtain by in vitro fertilization that are implantedand gestated are not specifically at risk for a congenital disorder.

Thus, according to some embodiments, the present invention extends tothe use of the systems, methods, and techniques of the invention inconjunction with pre-implantation diagnosis procedures.

According to some embodiments, the present invention extends to the useof the systems, methods, and techniques of the invention in conjunctionwith prenatal testing procedures.

According to some embodiments, the systems, methods, and techniques ofthe invention are used in methods to decrease the probability for theimplantation of an embryo specifically at risk for a congenital disorderby testing at least one cell removed from early embryos conceived by invitro fertilization and transferring to the mother's uterus only thoseembryos determined not to have inherited the congenital disorder.

According to some embodiments, the systems, methods, and techniques ofthe invention are used in methods to decrease the probability for theimplantation of an embryo specifically at risk for a chromosomeabnormality by testing at least one cell removed from early embryosconceived by in vitro fertilization and transferring to the mother'suterus only those embryos determined not to have chromosomeabnormalities.

According to some embodiments, the systems, methods, and techniques ofthe invention are used in methods to increase the probability ofimplanting an embryo obtained by in vitro fertilization that is at areduced risk of carrying a congenital disorder.

According to some embodiments, the systems, methods, and techniques ofthe invention are used in methods to increase the probability ofgestating a fetus.

According to preferred embodiments, the congenital disorder is amalformation, neural tube defect, chromosome abnormality, Down'ssyndrome (or trisomy 21), Trisomy 18, spina bifida, cleft palate, TaySachs disease, sickle cell anemia, thalassemia, cystic fibrosis,Huntington's disease, and/or fragile x syndrome. Chromosomeabnormalities include, but are not limited to, Down syndrome (extrachromosome 21), Turner Syndrome (45X0) and Klinefelter's syndrome (amale with 2 X chromosomes).

According to preferred embodiments, the malformation is a limbmalformation. Limb malformations include, but are not limited to,amelia, ectrodactyly, phocomelia, polymelia, polydactyly, syndactyly,polysyndactyly, oligodactyly, brachydactyly, achondroplasia, congenitalaplasia or hypoplasia, amniotic band syndrome, and cleidocranialdysostosis.

According to preferred embodiments, the malformation is a congenitalmalformation of the heart. Congenital malformations of the heartinclude, but are not limited to, patent ductus arteriosus, atrial septaldefect, ventricular septal defect, and tetralogy of fallot.

According to preferred embodiments, the malformation is a congenitalmalformation of the nervous system. Congenital malformations of thenervous system include, but are not limited to, neural tube defects(e.g., spina bifida, meningocele, meningomyelocele, encephalocele andanencephaly), Arnold-Chiari malformation, the Dandy-Walker malformation,hydrocephalus, microencephaly, megencephaly, lissencephaly,polymicrogyria, holoprosencephaly, and agenesis of the corpus callosum.

According to preferred embodiments, the malformation is a congenitalmalformation of the gastrointestinal system. Congenital malformations ofthe gastrointestinal system include, but are not limited to, stenosis,atresia, and imperforate anus.

According to some embodiments, the systems, methods, and techniques ofthe invention are used in methods to increase the probability ofimplanting an embryo obtained by in vitro fertilization that is at areduced risk of carrying a predisposition for a genetic disease.

According to preferred embodiments, the genetic disease is eithermonogenic or multigenic. Genetic diseases include, but are not limitedto, Bloom Syndrome, Canavan Disease, Cystic fibrosis, FamilialDysautonomia, Riley-Day syndrome, Fanconi Anemia (Group C), GaucherDisease, Glycogen storage disease 1a, Maple syrup urine disease,Mucolipidosis IV, Niemann-Pick Disease, Tay-Sachs disease, Betathalessemia, Sickle cell anemia, Alpha thalessemia, Beta thalessemia,Factor XI Deficiency, Friedreich's Ataxia, MCAD, Parkinsondisease-juvenile, Connexin26, SMA, Rett syndrome, Phenylketonuria,Becker Muscular Dystrophy, Duchennes Muscular Dystrophy, Fragile Xsyndrome, Hemophilia A, Alzheimer dementia-early onset, Breast/Ovariancancer, Colon cancer, Diabetes/MODY, Huntington disease, MyotonicMuscular Dystrophy, Parkinson Disease-early onset, Peutz-Jegherssyndrome, Polycystic Kidney Disease, Torsion Dystonia

Combinations of the Aspects of the Invention

As noted previously, given the benefit of this disclosure, there aremore aspects and embodiments that may implement one or more of thesystems, methods, and features, disclosed herein. Below is a short listof examples illustrating situations in which the various aspects of thedisclosed invention can be combined in a plurality of ways. It isimportant to note that this list is not meant to be comprehensive; manyother combinations of the aspects, methods, features and embodiments ofthis invention are possible.

In one embodiment of the invention, it is possible to combine several ofthe aspect of the invention such that one could perform both allelecalling as well as aneuploidy calling in one step, and to usequantitative values instead of qualitative for both parts. It should beobvious to one skilled in the art how to combine the relevantmathematics without changing the essence of the invention.

In a preferred embodiment of the invention, the disclosed method isemployed to determine the genetic state of one or more embryos for thepurpose of embryo selection in the context of IVF. This may include theharvesting of eggs from the prospective mother and fertilizing thoseeggs with sperm from the prospective father to create one or moreembryos. It may involve performing embryo biopsy to isolate a blastomerefrom each of the embryos. It may involve amplifying and genotyping thegenetic data from each of the blastomeres. It may include obtaining,amplifying and genotyping a sample of diploid genetic material from eachof the parents, as well as one or more individual sperm from the father.It may involve incorporating the measured diploid and haploid data ofboth the mother and the father, along with the measured genetic data ofthe embryo of interest into a dataset. It may involve using one or moreof the statistical methods disclosed in this patent to determine themost likely state of the genetic material in the embryo given themeasured or determined genetic data. It may involve the determination ofthe ploidy state of the embryo of interest. It may involve thedetermination of the presence of a plurality of known disease-linkedalleles in the genome of the embryo. It may involve making phenotypicpredictions about the embryo. It may involve generating a report that issent to the physician of the couple so that they may make an informeddecision about which embryo(s) to transfer to the prospective mother.

Another example could be a situation where a 44-year old womanundergoing IVF is having trouble conceiving. The couple arranges to haveher eggs harvested and fertilized with sperm from the man, producingnine viable embryos. A blastomere is harvested from each embryo, and thegenetic data from the blastomeres are measured using an ILLUMINAINFINIUM BEAD ARRAY. Meanwhile, the diploid data are measured fromtissue taken from both parents also using the ILLUMINA INFINIUM BEADARRAY. Haploid data from the father's sperm is measured using the samemethod. The method disclosed herein is applied to the genetic data ofthe blastomere and the diploid maternal genetic data to phase thematernal genetic data to provide the maternal haplotype. Those data arethen incorporated, along with the father's diploid and haploid data, toallow a highly accurate determination of the copy number count for eachof the chromosomes in each of the embryos. Eight of the nine embryos arefound to be aneuploid, and the one embryo is found to be euploid. Areport is generated that discloses these diagnoses, and is sent to thedoctor. The report has data similar to the data found in Tables 9, 10and 11. The doctor, along with the prospective parents, decides totransfer the euploid embryo which implants in the mother's uterus.

Another example may involve a pregnant woman who has been artificiallyinseminated by a sperm donor, and is pregnant. She is wants to minimizethe risk that the fetus she is carrying has a genetic disease. Sheundergoes amniocentesis and fetal cells are isolated from the withdrawnsample, and a tissue sample is also collected from the mother. Sincethere are no other embryos, her data are phased using molecularhaplotyping methods. The genetic material from the fetus and from themother are amplified as appropriate and genotyped using the ILLUMINAINFINIUM BEAD ARRAY, and the methods described herein reconstruct theembryonic genotype as accurately as possible. Phenotypicsusceptibilities are predicted from the reconstructed fetal genetic dataand a report is generated and sent to the mother's physician so thatthey can decide what actions may be best.

Another example could be a situation where a racehorse breeder wants toincrease the likelihood that the foals sired by his champion racehorsebecome champions themselves. He arranges for the desired mare to beimpregnated by IVF, and uses genetic data from the stallion and the mareto clean the genetic data measured from the viable embryos. The cleanedembryonic genetic data allows the breeder to select the embryos forimplantation that are most likely to produce a desirable racehorse.

Tables 1-11

Table 1. Probability distribution of measured allele calls given thetrue genotype.

Table 2. Probabilities of specific allele calls in the embryo using theU and H notation.

Table 3. Conditional probabilities of specific allele calls in theembryo given all possible parental states.

Table 4. Constraint Matrix (A).

Table 5. Notation for the counts of observations of all specificembryonic allelic states given all possible parental states.

Table 6. Aneuploidy states (h) and corresponding P(h|n_(j)), theconditional probabilities given the copy numbers.

Table 7. Probability of aneuploidy hypothesis (H) conditional on parentgenotype.

Table 8. Results of PS algorithm applied to 69 SNPs on chromosome 7.

Table 9. Aneuploidy calls on eight known euploid cells.

Table 10. Aneuploidy calls on ten known trisomic cells.

Table 11. Aneuploidy calls for six blastomeres.

TABLE 1 Probability distribution of measured allele calls given the truegenotype. p(dropout) = 0.5, p(gain) = 0.02 measured true AA AB BB XX AA0.735 0.015 0.005 0.245 AB 0.250 0.250 0.250 0.250 BB 0.005 0.015 0.7350.245

TABLE 2 Probabilities of specific allele calls in the embryo using the Uand H notation. Embryo readouts Embryo truth state U H Ū empty U p₁₁ p₁₂p₁₃ p₁₄ H p₂₁ p₂₂ p₂₃ p₂₄

TABLE 3 Conditional probabilities of specific allele calls in the embryogiven all possible parental states. Expected truth Embryo readouts typesand Parental state in conditional probabilities matings the embryo U H Ūempty UxU U p₁₁ p₁₂ p₁₃ p₁₄ UxŪ H p₂₁ p₂₂ p₂₃ p₂₄ UxH 50% U, 50% H p₃₁p₃₂ p₃₃ p₃₄ HxH 25% U, 25% Ū, p₄₁ p₄₂ p₄₃ p₄₄ 50% H

TABLE 4 Constraint Matrix (A). 1 1 1 1 1 1 1 1 1 −1 −.5 −.5 1 −.5 −.5 1−.5 −.5 1 −.5 −.5 1 −.25 −.25 −.5 1 −.5 −.5 1 −.25 −.25 −.5 1 −.5 −.5 1

TABLE 5 Notation for the counts of observations of all specificembryonic allelic states given all possible parental states. Embryoreadouts types and Parental Expected embryo observed counts matingstruth state U H Ū Empty UxU U n₁₁ n₁₂ n₁₃ n₁₄ UxŪ H n₂₁ n₂₂ n₂₃ n₂₄ UxH50% U, 50% H n₃₁ n₃₂ n₃₃ n₃₄ HxH 25% U, 25% Ū, n₄₁ n₄₂ n₄₃ n₄₄ 50% H

TABLE 6 Anenploidy states (h) and corresponding P(h|n_(j)), theconditional probabilities given the copy numbers. N H P(h|n) In General1 paternal monosomy 0.5 Ppm 1 maternal monosomy 0.5 Pmm 2 Disomy 1 1 3paternal trisomy t1 0.5*pt1 ppt*pt1 3 paternal trisomy t2 0.5*pt2ppt*pt2 3 maternal trisomy t1 0.5*pm1 pmt*mt1 3 maternal trisomy t20.5*pm2 pmt*mt2

TABLE 7 Probability of aneuploidy hypothesis (H) conditional on parentgenotype. embryo allele counts hypothesis (mother, father) genotype copy# nA nC H AA, AA AA, AC AA, CC AC, AA AC, AC AC, CC CC, AA CC, AC CC, CC1 1 0 father only 1 1 1 0.5 0.5 0.5 0 0 0 1 1 0 mother only 1 0.5 0 10.5 0 1 0.5 0 1 0 1 father only 0 0 0 0 0.5 0.5 1 1 1 1 0 1 mother only0 0.5 1 0.5 0.5 1 0 0.5 1 2 2 0 disomy 1 0.5 0 0.5 0.25 0 0 0 0 2 1 1disomy 0 0.5 1 0.5 0.5 0.5 1 0.5 0 2 0 2 disomy 0 0 0 0 0.25 0.5 0 0.5 13 3 0 father t1 1 0.5 0 0 0 0 0 0 0 3 3 0 father t2 1 0.5 0 0.5 0.25 0 00 0 3 3 0 mother t1 1 0 0 0.5 0 0 0 0 0 3 3 0 mother t2 1 0.5 0 0.5 0.250 0 0 0 3 2 1 father t1 0 0.5 1 1 0.5 0 0 0 0 3 2 1 father t2 0 0.5 1 00.25 0.5 0 0 0 3 2 1 mother t1 0 1 0 0.5 0.5 0 1 0 0 3 2 1 mother t2 0 00 0.5 0.25 0 1 0.5 0 3 1 2 father t1 0 0 0 0 0.5 1 1 0.5 0 3 1 2 fathert2 0 0 0 0.5 0.25 0 0 0.5 0 3 1 2 mother t1 0 0 1 0 0.5 0.5 0 1 0 3 1 2mother t2 0 0.5 1 0 0.25 0.5 0 0 0 3 0 3 father t1 0 0 0 0 0 0 0 0.5 1 30 3 father t2 0 0 0 0 0.25 0.5 0 0.5 1 3 0 3 mother t1 0 0 0 0 0 0.5 0 01 3 0 3 mother t2 0 0 0 0 0.25 0.5 0 0.5 1

TABLE 8 Results of PS algorithm applied to 69 SNPs on chromosome 7 probeid Sup id p1 p2 m1 m2 b11 b12 b21 b22 b31 b32 h1 h2 h3 e1 e2 T1 T2 E1 E2conf 80 C_2972977_10 A A A A A A A A X X A A X A A A A A A 0.9983 81C_2972981_10 T T T T X X X X X X X X X T T T T T T 0.9775 98C_11611980_10 G A A A A A G G X X G X X — — G A

0.9890 26 C_2280961_1_ G C G C X X X X X X X X X — — G C — — — 8C_341386_1_ C C C C X X C C C C C X C C C C C C C 0.9988 9 C_341387_1_ CC C C C C C C C C X X C C C C C C C 0.9982 71 C_2606775_1_ G G G G X X XX X X X X X G G G G G G 0.9851 72 C_2606779_1_ G G G G X X G G G G X G XG G G G G G 0.9972 27 C_2280966_10 T A T A X X X X X X X X X — — T A

0.9211 73 C_2606790_10 T C T C X X T T X X X X X — — T C — — — 105C_22273192_10 G G G A X X G G G A G G X — — G A

0.9917 106 C_25619317_10 A A A A X X A A X X X X X A A A A A A 0.9940 64C_2559556_10 G G T G X X T T X X G G X — — T G

0.9798 20 C_2258563_20 T T T C C C X X X X T X T — — T C

0.9324 49 C_2546376_1_ A A G A G G X X G A A X A — — G A

0.9887 50 C_2546377_20 G A G A G G X X G A X X A G A G A G A 0.9668 21C_2258567_10 T C T C X X X X X X X C X T C T C T C 0.8924 52C_2546385_10 T T T A X X A A X X T X T — — T A

0.9283 53 C_2546435_1_ C C C C C C C C C C C C C C C C C C C 0.9994 104C_16163603_10 T C T C T C X X X X X X X T C T C T C 0.9840 92C_8966543_10 A A G A A A G G X X X A X — — G A

0.8622 96 C_11438986_10 G A G A X X G A G A X G G G A G A G A 0.9977 82C_2982699_10 A A G A A A A A X X A A A A A A A A A 0.9292 99C_15796183_10 A A C A A A A A A A A A A A A A A A A 0.9705 43C_2546319_10 T T T T T T T T T T X T X T T T T T T 0.9992 44C_2546335_20 T C T T T T X X X X X T X T T T T T T 0.8508 45C_2546344_10 T T T T X X T T T T T T X T T T T T T 0.9961 46C_2546353_1_ T T T C X X T T X X T T T — — T C

0.9346 56 C_2555662_10 T C T C X X C C X X C C C — — T C

0.9328 57 C_2555670_10 T C T T T T T T T C T X T — — T C

0.9604 58 C_2555685_10 T C T T T T T T T T T C C T T T T T T 0.9982 59C_2555706_10 A A G A A A X X A A X G G A A A A A A 0.9345 15C_1843560_10 T C T C X X X X X X C C T — — T C

0.9683 102 C_16151234_10 G A G G X X X X X X A X G — — G A

0.9944 94 C_11436903_10 G C C C C C G C G C X C C — — G C

0.9991 18 C_2256696_20 G A G G G G G G G G A A X G G G G G G 0.9638 7C_328336_10 T T T T T T X X X X X X T T T T T T T 0.9927 37 C_2543108_10T G G G G X X G G G G G X T G G G G G 0.9976 107 C_25632606_10 A A A A AA X X A A A X A A A A A A A 0.9984 38 C_2543111_10 T C C C C C X X C C CX X C C C C C C 0.9948 40 C_2543116_10 T C T T X X T T T T T C C T T T TT T 0.9977 86 C_8852708_10 T C C C X X C C C C X X X C C C C C C 0.931767 C_2602203_10 A A C A A A X X A A X X X A A A A A A 0.9187 24C_2279233_10 T C C C C C X X C C X T T C C C C C C 0.9968 68C_2602208_10 A A A A A A X X X X X X X A A A A A A 0.9937 69C_2602221_10 T G T T X X X X X X X X X T T T T T T 0.9810 14 C_656774_1_G A A A X X A A X X X X X A A A A A A 0.8485 85 C_3021372_10 G A G G G GG G X X G A A G G G G G G 0.9973 83 C_3021345_10 A A A A X X A A X X X XX A A A A A A 0.9955 12 C_656644_20 C C C C C C C C C C X X X C C C C CC 0.9992 11 C_656642_1_ A A G G G G G G G G X A X G G G G G G 0.7858 61C_2558137_10 T C T T T T X X T T C C C T T T T T T 0.9957 87C_8853467_10 G G G G X X X X X X X X X G G G G G G 0.9866 3 C_17027_10 TT T T T T T T X X T T T T T T T T T 0.9959 28 C_2540863_10 T C C C T T TT X X X X X — — T C

0.9561 31 C_2540896_1_ T C T T X X X X X X X X X — — T C

0.9058 35 C_2540935_10 C C C C X X X X C C X X X C C C C C C 0.9944 36C_2540940_10 T G T G T T X X T G T X T — — T G

0.9828 109 C_25632779_10 A A G A X X A A A A A A A A A A A A A 0.9269 4C_75266_10 T C C C X X X X X X X X X — — T C

0.9954 76 C_2668636_10 A A G A A A A A A A A A A A A A A A A 0.9689 77C_2668640_10 T T G G T T T T T T X X X T T T T T T 0.6498 22C_2259838_10 G C G G G G X X G C G G G G C G C G C 0.9988 23C_2259850_10 A A G A X X A A A A X X A A A A A A A 0.9289 10 C_349428_10A A A A A A A A A A X A X A A A A A A 0.9974 5 C_321446_10 T T T G X X TT X X T T T T T T T T T 0.8457 42 C_2545620_10 A A A A A A X X A A X X XA A A A A A 0.9944 19 C_2258307_10 G G G A G G G G G G X X X G G G G G G0.9684 93 C_8853956_10 G G G A X X G G X X G X X G G G G G G 0.8460

TABLE 9 Aneuploidy calls on eight known euploid cells Chr # Cell 1 Cell2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 1 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 3 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 4 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 5 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 6 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 7 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 8 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 9 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 10 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 11 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 12 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 13 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 14 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 15 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 16 2 1.00000 20.99997 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 17 21.00000 2 0.99995 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 18 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 19 2 1.00000 2 0.99998 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 20 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 21 2 0.99993 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 22 2 1.00000 2 1.00000 20.99040 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 0.99992 X 2 0.99999 20.99994 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000

TABLE 10 Aneuploidy calls on ten known trisomic cells Chr # Cell 1 Cell2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 Cell 10 1 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 3 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 4 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 5 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 6 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 7 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 8 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 9 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 10 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 11 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 12 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 13 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 14 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 15 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 16 2 1.00000 2 0.99997 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 17 2 1.00000 2 0.99995 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 18 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 19 2 1.00000 2 0.99997 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 20 21.00000 2 0.99995 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 2 1.00000 21 — 1.00000 — 1.00000 — 1.00000 — 1.00000 —1.00000 — 1.00000 — 1.00000 — 1.00000 — 1.00000 — 1.00000 22 2 1.00000 21.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 2 1.00000 21.00000 2 1.00000 23 1 1.00000 1 1.00000 1 1.00000 1 1.00000 1 1.00000 11.00000 1 1.00000 1 1.00000 1 1.00000 1 1.00000

TABLE 11 Aneuploidy calls for six blastomeres Chr # e1b1 e1b3 e1b6 e2b1e2b2 e3b2 1 2 1.00000 2 1.00000 1 1.00000 1 1.00000 1 1.00000 3 1.000002 2 1.00000 2 1.00000 3 1.00000 1 1.00000 1 1.00000 2 0.99994 3 21.00000 2 1.00000 2 1.00000 1 1.00000 1 1.00000 3 1.00000 4 2 1.00000 21.00000 2 1.00000 1 1.00000 1 1.00000 3 1.00000 5 2 1.00000 2 1.00000 21.00000 1 1.00000 1 1.00000 3 0.99964 6 2 1.00000 2 1.00000 2 1.00000 11.00000 1 1.00000 3 1.00000 7 2 1.00000 2 1.00000 2 1.00000 1 1.00000 11.00000 2 0.99866 8 2 1.00000 2 1.00000 3 0.99966 1 1.00000 1 1.00000 31.00000 9 2 1.00000 2 1.00000 2 1.00000 1 1.00000 1 1.00000 3 0.99999 102 1.00000 2 1.00000 2 1.00000 1 1.00000 1 1.00000 1 1.00000 11 2 1.000002 1.00000 3 1.00000 1 1.00000 1 1.00000 2 0.99931 12 2 1.00000 2 1.000002 1.00000 1 1.00000 1 1.00000 1 1.00000 13 2 1.00000 2 1.00000 3 0.989021 1.00000 1 1.00000 2 0.99969 14 2 1.00000 2 1.00000 2 0.99991 1 1.000001 1.00000 3 1.00000 15 2 1.00000 2 1.00000 2 0.99986 1 1.00000 1 1.000003 0.99999 16 2 1.00000 3 0.98609 2 0.74890 1 1.00000 1 1.00000 2 0.9412617 2 1.00000 2 1.00000 2 0.97983 1 1.00000 1 1.00000 2 1.00000 18 21.00000 2 1.00000 2 0.98367 1 1.00000 1 1.00000 1 1.00000 19 2 1.00000 21.00000 4 0.64546 1 1.00000 1 1.00000 3 1.00000 20 2 1.00000 2 1.00000 30.58327 1 1.00000 1 1.00000 2 0.95078 21 2 0.99952 2 1.00000 2 0.97594 11.00000 1 1.00000 1 0.99776 22 2 1.00000 2 0.98219 2 0.99217 1 1.00000 10.99989 2 1.00000 23 2 1.00000 3 1.00000 3 1.00000 1 1.00000 1 1.00000 30.99998 24 1 0.99122 1 0.99778 1 0.99999

What is claimed is:
 1. A method for determining genetic data for DNAfrom a first individual in a biological sample of a second individual,the method comprising: amplifying a plurality of target loci oncell-free DNA extracted from the biological sample to generate amplifiedproducts; sequencing the amplified products by sequencing-by-synthesisto obtain genetic data of the plurality of target loci; determining themost likely genetic data for DNA from the first individual based onallele frequencies in the genetic data at the plurality of target loci.2. The method of claim 1, wherein the biological sample is a bloodsample.
 3. The method of claim 1, wherein the target loci are SNP loci.4. The method of claim 1, wherein the target loci are on a plurality ofchromosomes.
 5. The method of claim 1, wherein the amplifying comprisestargeted PCR.
 6. The method of claim 1, wherein the amplifying comprisestargeted PCR and universal PCR.
 7. The method of claim 1, wherein thesequencing-by-synthesis comprises clonal amplification of the amplifiedproducts and measurement of sequences of the clonally amplified DNA. 8.The method of claim 1, wherein the genetic data produced by sequencingis noisy and comprises allele drop out errors.
 9. The method of claim 1,wherein the genetic data produced by sequencing is noisy and comprisesmeasurement bias.
 10. The method of claim 1, wherein the genetic dataproduced by sequencing is noisy and comprises incorrect measurements.11. The method of claim 1, further comprising normalizing the geneticdata for differences in amplification and/or sequencing efficiencybetween the target loci.
 12. The method of claim 3, wherein theconfidence that each SNP is correctly called is at least 95%.
 13. Themethod of claim 3, wherein the confidence that each SNP is correctlycalled is at least 99%.
 14. The method of claim 1, wherein the targetloci comprises at least 70 SNPs.
 15. The method of claim 1, wherein thetarget loci comprises at least 131 SNPs.
 16. The method of claim 1,wherein the target loci comprises at least 384 SNPs.
 17. The method ofclaim 1, wherein the target loci comprises at least 500 SNPs.
 18. Amethod for determining genetic data for DNA from a first individual in ablood sample of a second individual, the method comprising: performingtargeted PCR to amplify a plurality of SNP loci on cell-free DNAextracted from the blood sample to generate amplified products, whereinthe SNP loci are on a plurality of chromosomes; sequencing the amplifiedproducts by sequencing-by-synthesis to obtain genetic data of theplurality of SNP loci, wherein the sequencing-by-synthesis comprisesclonal amplification of the amplified products and measurement ofsequences of the clonally amplified DNA; determining the most likelygenetic data for DNA from the first individual based on allelefrequencies in the genetic data at the plurality of SNP loci.
 19. Themethod of claim 18, wherein the confidence that each SNP is correctlycalled is at least 95%.
 20. The method of claim 18, wherein theconfidence that each SNP is correctly called is at least 99%.